commit d7d9103

dalem  ·  2026-02-28 15:16:38 +0000 UTC
parent 2e266a0
Remove /stb

/stb was needed when
everything was handled on
the swc side, now it happens
within swall, so it's all dead code
2 files changed,  +0, -23134
+0, -9875
   1@@ -1,9875 +0,0 @@
   2-/* stb_image - v2.30 - public domain image loader - http://nothings.org/stb
   3-                                  no warranty implied; use at your own risk
   4-
   5-   Do this:
   6-      #define STB_IMAGE_IMPLEMENTATION
   7-   before you include this file in *one* C or C++ file to create the
   8-implementation.
   9-
  10-   // i.e. it should look like this:
  11-   #include ...
  12-   #include ...
  13-   #include ...
  14-   #define STB_IMAGE_IMPLEMENTATION
  15-   #include "stb_image.h"
  16-
  17-   You can #define STBI_ASSERT(x) before the #include to avoid using assert.h.
  18-   And #define STBI_MALLOC, STBI_REALLOC, and STBI_FREE to avoid using
  19-malloc,realloc,free
  20-
  21-
  22-   QUICK NOTES:
  23-      Primarily of interest to game developers and other people who can
  24-          avoid problematic images and only need the trivial interface
  25-
  26-      JPEG baseline & progressive (12 bpc/arithmetic not supported, same as
  27-stock IJG lib) PNG 1/2/4/8/16-bit-per-channel
  28-
  29-      TGA (not sure what subset, if a subset)
  30-      BMP non-1bpp, non-RLE
  31-      PSD (composited view only, no extra channels, 8/16 bit-per-channel)
  32-
  33-      GIF (*comp always reports as 4-channel)
  34-      HDR (radiance rgbE format)
  35-      PIC (Softimage PIC)
  36-      PNM (PPM and PGM binary only)
  37-
  38-      Animated GIF still needs a proper API, but here's one way to do it:
  39-          http://gist.github.com/urraka/685d9a6340b26b830d49
  40-
  41-      - decode from memory or through FILE (define STBI_NO_STDIO to remove code)
  42-      - decode from arbitrary I/O callbacks
  43-      - SIMD acceleration on x86/x64 (SSE2) and ARM (NEON)
  44-
  45-   Full documentation under "DOCUMENTATION" below.
  46-
  47-
  48-LICENSE
  49-
  50-  See end of file for license information.
  51-
  52-RECENT REVISION HISTORY:
  53-
  54-      2.30  (2024-05-31) avoid erroneous gcc warning
  55-      2.29  (2023-05-xx) optimizations
  56-      2.28  (2023-01-29) many error fixes, security errors, just tons of stuff
  57-      2.27  (2021-07-11) document stbi_info better, 16-bit PNM support, bug
  58-fixes 2.26  (2020-07-13) many minor fixes 2.25  (2020-02-02) fix warnings 2.24
  59-(2020-02-02) fix warnings; thread-local failure_reason and flip_vertically 2.23
  60-(2019-08-11) fix clang static analysis warning 2.22  (2019-03-04) gif fixes, fix
  61-warnings 2.21  (2019-02-25) fix typo in comment 2.20  (2019-02-07) support utf8
  62-filenames in Windows; fix warnings and platform ifdefs 2.19  (2018-02-11) fix
  63-warning 2.18  (2018-01-30) fix warnings 2.17  (2018-01-29) bugfix, 1-bit BMP,
  64-16-bitness query, fix warnings 2.16  (2017-07-23) all functions have 16-bit
  65-variants; optimizations; bugfixes 2.15  (2017-03-18) fix png-1,2,4; all Imagenet
  66-JPGs; no runtime SSE detection on GCC 2.14  (2017-03-03) remove deprecated
  67-STBI_JPEG_OLD; fixes for Imagenet JPGs 2.13  (2016-12-04) experimental 16-bit
  68-API, only for PNG so far; fixes 2.12  (2016-04-02) fix typo in 2.11 PSD fix that
  69-caused crashes 2.11  (2016-04-02) 16-bit PNGS; enable SSE2 in non-gcc x64
  70-                         RGB-format JPEG; remove white matting in PSD;
  71-                         allocate large structures on the stack;
  72-                         correct channel count for PNG & BMP
  73-      2.10  (2016-01-22) avoid warning introduced in 2.09
  74-      2.09  (2016-01-16) 16-bit TGA; comments in PNM files; STBI_REALLOC_SIZED
  75-
  76-   See end of file for full revision history.
  77-
  78-
  79- ============================    Contributors    =========================
  80-
  81- Image formats                          Extensions, features
  82-    Sean Barrett (jpeg, png, bmp)          Jetro Lauha (stbi_info)
  83-    Nicolas Schulz (hdr, psd)              Martin "SpartanJ" Golini (stbi_info)
  84-    Jonathan Dummer (tga)                  James "moose2000" Brown (iPhone PNG)
  85-    Jean-Marc Lienher (gif)                Ben "Disch" Wenger (io callbacks)
  86-    Tom Seddon (pic)                       Omar Cornut (1/2/4-bit PNG)
  87-    Thatcher Ulrich (psd)                  Nicolas Guillemot (vertical flip)
  88-    Ken Miller (pgm, ppm)                  Richard Mitton (16-bit PSD)
  89-    github:urraka (animated gif)           Junggon Kim (PNM comments)
  90-    Christopher Forseth (animated gif)     Daniel Gibson (16-bit TGA)
  91-                                           socks-the-fox (16-bit PNG)
  92-                                           Jeremy Sawicki (handle all ImageNet
  93-JPGs) Optimizations & bugfixes                  Mikhail Morozov (1-bit BMP)
  94-    Fabian "ryg" Giesen                    Anael Seghezzi (is-16-bit query)
  95-    Arseny Kapoulkine                      Simon Breuss (16-bit PNM)
  96-    John-Mark Allen
  97-    Carmelo J Fdez-Aguera
  98-
  99- Bug & warning fixes
 100-    Marc LeBlanc            David Woo          Guillaume George     Martins
 101-Mozeiko Christpher Lloyd        Jerry Jansson      Joseph Thomson       Blazej
 102-Dariusz Roszkowski Phil Jordan                                Dave Moore Roy
 103-Eltham Hayaki Saito            Nathan Reed        Won Chun Luke Graham Johan
 104-Duparc       Nick Verigakis       the Horde3D community Thomas Ruf Ronny
 105-Chevalier                         github:rlyeh Janez Zemva             John
 106-Bartholomew   Michal Cichon        github:romigrou Jonathan Blow           Ken
 107-Hamada         Tero Hanninen        github:svdijk Eugene Golushkov Laurent
 108-Gomila     Cort Stratton        github:snagar Aruelien Pocheville     Sergio
 109-Gonzalez    Thibault Reuille     github:Zelex Cass Everitt            Ryamond
 110-Barbiero                        github:grim210 Paul Du Bois            Engin
 111-Manap        Aldo Culquicondor    github:sammyhw Philipp Wiesemann       Dale
 112-Weiler        Oriol Ferrer Mesia   github:phprus Josh Tobin              Neil
 113-Bickford      Matthew Gregan       github:poppolopoppo Julian Raschke Gregory
 114-Mullen     Christian Floisand   github:darealshinji Baldur Karlsson Kevin
 115-Schmidt      JR Smith             github:Michaelangel007 Brad Weinberger Matvey
 116-Cherevko      github:mosra Luca Sas                Alexander Veselov  Zack
 117-Middleton       [reserved] Ryan C. Gordon          [reserved] [reserved] DO NOT
 118-ADD YOUR NAME HERE
 119-
 120-                     Jacko Dirks
 121-
 122-  To add your name to the credits, pick a random blank space in the middle and
 123-fill it. 80% of merge conflicts on stb PRs are due to people adding their name
 124-at the end of the credits.
 125-*/
 126-
 127-#ifndef STBI_INCLUDE_STB_IMAGE_H
 128-#define STBI_INCLUDE_STB_IMAGE_H
 129-
 130-// DOCUMENTATION
 131-//
 132-// Limitations:
 133-//    - no 12-bit-per-channel JPEG
 134-//    - no JPEGs with arithmetic coding
 135-//    - GIF always returns *comp=4
 136-//
 137-// Basic usage (see HDR discussion below for HDR usage):
 138-//    int x,y,n;
 139-//    unsigned char *data = stbi_load(filename, &x, &y, &n, 0);
 140-//    // ... process data if not NULL ...
 141-//    // ... x = width, y = height, n = # 8-bit components per pixel ...
 142-//    // ... replace '0' with '1'..'4' to force that many components per pixel
 143-//    // ... but 'n' will always be the number that it would have been if you
 144-//    said 0 stbi_image_free(data);
 145-//
 146-// Standard parameters:
 147-//    int *x                 -- outputs image width in pixels
 148-//    int *y                 -- outputs image height in pixels
 149-//    int *channels_in_file  -- outputs # of image components in image file
 150-//    int desired_channels   -- if non-zero, # of image components requested in
 151-//    result
 152-//
 153-// The return value from an image loader is an 'unsigned char *' which points
 154-// to the pixel data, or NULL on an allocation failure or if the image is
 155-// corrupt or invalid. The pixel data consists of *y scanlines of *x pixels,
 156-// with each pixel consisting of N interleaved 8-bit components; the first
 157-// pixel pointed to is top-left-most in the image. There is no padding between
 158-// image scanlines or between pixels, regardless of format. The number of
 159-// components N is 'desired_channels' if desired_channels is non-zero, or
 160-// *channels_in_file otherwise. If desired_channels is non-zero,
 161-// *channels_in_file has the number of components that _would_ have been
 162-// output otherwise. E.g. if you set desired_channels to 4, you will always
 163-// get RGBA output, but you can check *channels_in_file to see if it's trivially
 164-// opaque because e.g. there were only 3 channels in the source image.
 165-//
 166-// An output image with N components has the following components interleaved
 167-// in this order in each pixel:
 168-//
 169-//     N=#comp     components
 170-//       1           grey
 171-//       2           grey, alpha
 172-//       3           red, green, blue
 173-//       4           red, green, blue, alpha
 174-//
 175-// If image loading fails for any reason, the return value will be NULL,
 176-// and *x, *y, *channels_in_file will be unchanged. The function
 177-// stbi_failure_reason() can be queried for an extremely brief, end-user
 178-// unfriendly explanation of why the load failed. Define STBI_NO_FAILURE_STRINGS
 179-// to avoid compiling these strings at all, and STBI_FAILURE_USERMSG to get
 180-// slightly more user-friendly ones.
 181-//
 182-// Paletted PNG, BMP, GIF, and PIC images are automatically depalettized.
 183-//
 184-// To query the width, height and component count of an image without having to
 185-// decode the full file, you can use the stbi_info family of functions:
 186-//
 187-//   int x,y,n,ok;
 188-//   ok = stbi_info(filename, &x, &y, &n);
 189-//   // returns ok=1 and sets x, y, n if image is a supported format,
 190-//   // 0 otherwise.
 191-//
 192-// Note that stb_image pervasively uses ints in its public API for sizes,
 193-// including sizes of memory buffers. This is now part of the API and thus
 194-// hard to change without causing breakage. As a result, the various image
 195-// loaders all have certain limits on image size; these differ somewhat
 196-// by format but generally boil down to either just under 2GB or just under
 197-// 1GB. When the decoded image would be larger than this, stb_image decoding
 198-// will fail.
 199-//
 200-// Additionally, stb_image will reject image files that have any of their
 201-// dimensions set to a larger value than the configurable STBI_MAX_DIMENSIONS,
 202-// which defaults to 2**24 = 16777216 pixels. Due to the above memory limit,
 203-// the only way to have an image with such dimensions load correctly
 204-// is for it to have a rather extreme aspect ratio. Either way, the
 205-// assumption here is that such larger images are likely to be malformed
 206-// or malicious. If you do need to load an image with individual dimensions
 207-// larger than that, and it still fits in the overall size limit, you can
 208-// #define STBI_MAX_DIMENSIONS on your own to be something larger.
 209-//
 210-// ===========================================================================
 211-//
 212-// UNICODE:
 213-//
 214-//   If compiling for Windows and you wish to use Unicode filenames, compile
 215-//   with
 216-//       #define STBI_WINDOWS_UTF8
 217-//   and pass utf8-encoded filenames. Call stbi_convert_wchar_to_utf8 to convert
 218-//   Windows wchar_t filenames to utf8.
 219-//
 220-// ===========================================================================
 221-//
 222-// Philosophy
 223-//
 224-// stb libraries are designed with the following priorities:
 225-//
 226-//    1. easy to use
 227-//    2. easy to maintain
 228-//    3. good performance
 229-//
 230-// Sometimes I let "good performance" creep up in priority over "easy to
 231-// maintain", and for best performance I may provide less-easy-to-use APIs that
 232-// give higher performance, in addition to the easy-to-use ones. Nevertheless,
 233-// it's important to keep in mind that from the standpoint of you, a client of
 234-// this library, all you care about is #1 and #3, and stb libraries DO NOT
 235-// emphasize #3 above all.
 236-//
 237-// Some secondary priorities arise directly from the first two, some of which
 238-// provide more explicit reasons why performance can't be emphasized.
 239-//
 240-//    - Portable ("ease of use")
 241-//    - Small source code footprint ("easy to maintain")
 242-//    - No dependencies ("ease of use")
 243-//
 244-// ===========================================================================
 245-//
 246-// I/O callbacks
 247-//
 248-// I/O callbacks allow you to read from arbitrary sources, like packaged
 249-// files or some other source. Data read from callbacks are processed
 250-// through a small internal buffer (currently 128 bytes) to try to reduce
 251-// overhead.
 252-//
 253-// The three functions you must define are "read" (reads some bytes of data),
 254-// "skip" (skips some bytes of data), "eof" (reports if the stream is at the
 255-// end).
 256-//
 257-// ===========================================================================
 258-//
 259-// SIMD support
 260-//
 261-// The JPEG decoder will try to automatically use SIMD kernels on x86 when
 262-// supported by the compiler. For ARM Neon support, you must explicitly
 263-// request it.
 264-//
 265-// (The old do-it-yourself SIMD API is no longer supported in the current
 266-// code.)
 267-//
 268-// On x86, SSE2 will automatically be used when available based on a run-time
 269-// test; if not, the generic C versions are used as a fall-back. On ARM targets,
 270-// the typical path is to have separate builds for NEON and non-NEON devices
 271-// (at least this is true for iOS and Android). Therefore, the NEON support is
 272-// toggled by a build flag: define STBI_NEON to get NEON loops.
 273-//
 274-// If for some reason you do not want to use any of SIMD code, or if
 275-// you have issues compiling it, you can disable it entirely by
 276-// defining STBI_NO_SIMD.
 277-//
 278-// ===========================================================================
 279-//
 280-// HDR image support   (disable by defining STBI_NO_HDR)
 281-//
 282-// stb_image supports loading HDR images in general, and currently the Radiance
 283-// .HDR file format specifically. You can still load any file through the
 284-// existing interface; if you attempt to load an HDR file, it will be
 285-// automatically remapped to LDR, assuming gamma 2.2 and an arbitrary scale
 286-// factor defaulting to 1; both of these constants can be reconfigured through
 287-// this interface:
 288-//
 289-//     stbi_hdr_to_ldr_gamma(2.2f);
 290-//     stbi_hdr_to_ldr_scale(1.0f);
 291-//
 292-// (note, do not use _inverse_ constants; stbi_image will invert them
 293-// appropriately).
 294-//
 295-// Additionally, there is a new, parallel interface for loading files as
 296-// (linear) floats to preserve the full dynamic range:
 297-//
 298-//    float *data = stbi_loadf(filename, &x, &y, &n, 0);
 299-//
 300-// If you load LDR images through this interface, those images will
 301-// be promoted to floating point values, run through the inverse of
 302-// constants corresponding to the above:
 303-//
 304-//     stbi_ldr_to_hdr_scale(1.0f);
 305-//     stbi_ldr_to_hdr_gamma(2.2f);
 306-//
 307-// Finally, given a filename (or an open file or memory block--see header
 308-// file for details) containing image data, you can query for the "most
 309-// appropriate" interface to use (that is, whether the image is HDR or
 310-// not), using:
 311-//
 312-//     stbi_is_hdr(char *filename);
 313-//
 314-// ===========================================================================
 315-//
 316-// iPhone PNG support:
 317-//
 318-// We optionally support converting iPhone-formatted PNGs (which store
 319-// premultiplied BGRA) back to RGB, even though they're internally encoded
 320-// differently. To enable this conversion, call
 321-// stbi_convert_iphone_png_to_rgb(1).
 322-//
 323-// Call stbi_set_unpremultiply_on_load(1) as well to force a divide per
 324-// pixel to remove any premultiplied alpha *only* if the image file explicitly
 325-// says there's premultiplied data (currently only happens in iPhone images,
 326-// and only if iPhone convert-to-rgb processing is on).
 327-//
 328-// ===========================================================================
 329-//
 330-// ADDITIONAL CONFIGURATION
 331-//
 332-//  - You can suppress implementation of any of the decoders to reduce
 333-//    your code footprint by #defining one or more of the following
 334-//    symbols before creating the implementation.
 335-//
 336-//        STBI_NO_JPEG
 337-//        STBI_NO_PNG
 338-//        STBI_NO_BMP
 339-//        STBI_NO_PSD
 340-//        STBI_NO_TGA
 341-//        STBI_NO_GIF
 342-//        STBI_NO_HDR
 343-//        STBI_NO_PIC
 344-//        STBI_NO_PNM   (.ppm and .pgm)
 345-//
 346-//  - You can request *only* certain decoders and suppress all other ones
 347-//    (this will be more forward-compatible, as addition of new decoders
 348-//    doesn't require you to disable them explicitly):
 349-//
 350-//        STBI_ONLY_JPEG
 351-//        STBI_ONLY_PNG
 352-//        STBI_ONLY_BMP
 353-//        STBI_ONLY_PSD
 354-//        STBI_ONLY_TGA
 355-//        STBI_ONLY_GIF
 356-//        STBI_ONLY_HDR
 357-//        STBI_ONLY_PIC
 358-//        STBI_ONLY_PNM   (.ppm and .pgm)
 359-//
 360-//   - If you use STBI_NO_PNG (or _ONLY_ without PNG), and you still
 361-//     want the zlib decoder to be available, #define STBI_SUPPORT_ZLIB
 362-//
 363-//  - If you define STBI_MAX_DIMENSIONS, stb_image will reject images greater
 364-//    than that size (in either width or height) without further processing.
 365-//    This is to let programs in the wild set an upper bound to prevent
 366-//    denial-of-service attacks on untrusted data, as one could generate a
 367-//    valid image of gigantic dimensions and force stb_image to allocate a
 368-//    huge block of memory and spend disproportionate time decoding it. By
 369-//    default this is set to (1 << 24), which is 16777216, but that's still
 370-//    very big.
 371-
 372-#ifndef STBI_NO_STDIO
 373-#include <stdio.h>
 374-#endif // STBI_NO_STDIO
 375-
 376-#define STBI_VERSION 1
 377-
 378-enum {
 379-	STBI_default = 0, // only used for desired_channels
 380-
 381-	STBI_grey = 1,
 382-	STBI_grey_alpha = 2,
 383-	STBI_rgb = 3,
 384-	STBI_rgb_alpha = 4
 385-};
 386-
 387-#include <stdlib.h>
 388-typedef unsigned char stbi_uc;
 389-typedef unsigned short stbi_us;
 390-
 391-#ifdef __cplusplus
 392-extern "C" {
 393-#endif
 394-
 395-#ifndef STBIDEF
 396-#ifdef STB_IMAGE_STATIC
 397-#define STBIDEF static
 398-#else
 399-#define STBIDEF extern
 400-#endif
 401-#endif
 402-
 403-//////////////////////////////////////////////////////////////////////////////
 404-//
 405-// PRIMARY API - works on images of any type
 406-//
 407-
 408-//
 409-// load image by filename, open file, or memory buffer
 410-//
 411-
 412-typedef struct {
 413-	int (*read)(void *user, char *data,
 414-	            int size); // fill 'data' with 'size' bytes.  return number of
 415-	                       // bytes actually read
 416-	void (*skip)(void *user, int n); // skip the next 'n' bytes, or 'unget' the
 417-	                                 // last -n bytes if negative
 418-	int (*eof)(void *user); // returns nonzero if we are at end of file/data
 419-} stbi_io_callbacks;
 420-
 421-////////////////////////////////////
 422-//
 423-// 8-bits-per-channel interface
 424-//
 425-
 426-STBIDEF stbi_uc *
 427-stbi_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y,
 428-                      int *channels_in_file, int desired_channels);
 429-STBIDEF stbi_uc *
 430-stbi_load_from_callbacks(stbi_io_callbacks const *clbk, void *user, int *x,
 431-                         int *y, int *channels_in_file, int desired_channels);
 432-
 433-#ifndef STBI_NO_STDIO
 434-STBIDEF stbi_uc *
 435-stbi_load(char const *filename, int *x, int *y, int *channels_in_file,
 436-          int desired_channels);
 437-STBIDEF stbi_uc *
 438-stbi_load_from_file(FILE *f, int *x, int *y, int *channels_in_file,
 439-                    int desired_channels);
 440-// for stbi_load_from_file, file pointer is left pointing immediately after
 441-// image
 442-#endif
 443-
 444-#ifndef STBI_NO_GIF
 445-STBIDEF stbi_uc *
 446-stbi_load_gif_from_memory(stbi_uc const *buffer, int len, int **delays, int *x,
 447-                          int *y, int *z, int *comp, int req_comp);
 448-#endif
 449-
 450-#ifdef STBI_WINDOWS_UTF8
 451-STBIDEF int
 452-stbi_convert_wchar_to_utf8(char *buffer, size_t bufferlen,
 453-                           const wchar_t *input);
 454-#endif
 455-
 456-////////////////////////////////////
 457-//
 458-// 16-bits-per-channel interface
 459-//
 460-
 461-STBIDEF stbi_us *
 462-stbi_load_16_from_memory(stbi_uc const *buffer, int len, int *x, int *y,
 463-                         int *channels_in_file, int desired_channels);
 464-STBIDEF stbi_us *
 465-stbi_load_16_from_callbacks(stbi_io_callbacks const *clbk, void *user, int *x,
 466-                            int *y, int *channels_in_file,
 467-                            int desired_channels);
 468-
 469-#ifndef STBI_NO_STDIO
 470-STBIDEF stbi_us *
 471-stbi_load_16(char const *filename, int *x, int *y, int *channels_in_file,
 472-             int desired_channels);
 473-STBIDEF stbi_us *
 474-stbi_load_from_file_16(FILE *f, int *x, int *y, int *channels_in_file,
 475-                       int desired_channels);
 476-#endif
 477-
 478-////////////////////////////////////
 479-//
 480-// float-per-channel interface
 481-//
 482-#ifndef STBI_NO_LINEAR
 483-STBIDEF float *
 484-stbi_loadf_from_memory(stbi_uc const *buffer, int len, int *x, int *y,
 485-                       int *channels_in_file, int desired_channels);
 486-STBIDEF float *
 487-stbi_loadf_from_callbacks(stbi_io_callbacks const *clbk, void *user, int *x,
 488-                          int *y, int *channels_in_file, int desired_channels);
 489-
 490-#ifndef STBI_NO_STDIO
 491-STBIDEF float *
 492-stbi_loadf(char const *filename, int *x, int *y, int *channels_in_file,
 493-           int desired_channels);
 494-STBIDEF float *
 495-stbi_loadf_from_file(FILE *f, int *x, int *y, int *channels_in_file,
 496-                     int desired_channels);
 497-#endif
 498-#endif
 499-
 500-#ifndef STBI_NO_HDR
 501-STBIDEF void
 502-stbi_hdr_to_ldr_gamma(float gamma);
 503-STBIDEF void
 504-stbi_hdr_to_ldr_scale(float scale);
 505-#endif // STBI_NO_HDR
 506-
 507-#ifndef STBI_NO_LINEAR
 508-STBIDEF void
 509-stbi_ldr_to_hdr_gamma(float gamma);
 510-STBIDEF void
 511-stbi_ldr_to_hdr_scale(float scale);
 512-#endif // STBI_NO_LINEAR
 513-
 514-// stbi_is_hdr is always defined, but always returns false if STBI_NO_HDR
 515-STBIDEF int
 516-stbi_is_hdr_from_callbacks(stbi_io_callbacks const *clbk, void *user);
 517-STBIDEF int
 518-stbi_is_hdr_from_memory(stbi_uc const *buffer, int len);
 519-#ifndef STBI_NO_STDIO
 520-STBIDEF int
 521-stbi_is_hdr(char const *filename);
 522-STBIDEF int
 523-stbi_is_hdr_from_file(FILE *f);
 524-#endif // STBI_NO_STDIO
 525-
 526-// get a VERY brief reason for failure
 527-// on most compilers (and ALL modern mainstream compilers) this is threadsafe
 528-STBIDEF const char *
 529-stbi_failure_reason(void);
 530-
 531-// free the loaded image -- this is just free()
 532-STBIDEF void
 533-stbi_image_free(void *retval_from_stbi_load);
 534-
 535-// get image dimensions & components without fully decoding
 536-STBIDEF int
 537-stbi_info_from_memory(stbi_uc const *buffer, int len, int *x, int *y,
 538-                      int *comp);
 539-STBIDEF int
 540-stbi_info_from_callbacks(stbi_io_callbacks const *clbk, void *user, int *x,
 541-                         int *y, int *comp);
 542-STBIDEF int
 543-stbi_is_16_bit_from_memory(stbi_uc const *buffer, int len);
 544-STBIDEF int
 545-stbi_is_16_bit_from_callbacks(stbi_io_callbacks const *clbk, void *user);
 546-
 547-#ifndef STBI_NO_STDIO
 548-STBIDEF int
 549-stbi_info(char const *filename, int *x, int *y, int *comp);
 550-STBIDEF int
 551-stbi_info_from_file(FILE *f, int *x, int *y, int *comp);
 552-STBIDEF int
 553-stbi_is_16_bit(char const *filename);
 554-STBIDEF int
 555-stbi_is_16_bit_from_file(FILE *f);
 556-#endif
 557-
 558-// for image formats that explicitly notate that they have premultiplied alpha,
 559-// we just return the colors as stored in the file. set this flag to force
 560-// unpremultiplication. results are undefined if the unpremultiply overflow.
 561-STBIDEF void
 562-stbi_set_unpremultiply_on_load(int flag_true_if_should_unpremultiply);
 563-
 564-// indicate whether we should process iphone images back to canonical format,
 565-// or just pass them through "as-is"
 566-STBIDEF void
 567-stbi_convert_iphone_png_to_rgb(int flag_true_if_should_convert);
 568-
 569-// flip the image vertically, so the first pixel in the output array is the
 570-// bottom left
 571-STBIDEF void
 572-stbi_set_flip_vertically_on_load(int flag_true_if_should_flip);
 573-
 574-// as above, but only applies to images loaded on the thread that calls the
 575-// function this function is only available if your compiler supports
 576-// thread-local variables; calling it will fail to link if your compiler doesn't
 577-STBIDEF void
 578-stbi_set_unpremultiply_on_load_thread(int flag_true_if_should_unpremultiply);
 579-STBIDEF void
 580-stbi_convert_iphone_png_to_rgb_thread(int flag_true_if_should_convert);
 581-STBIDEF void
 582-stbi_set_flip_vertically_on_load_thread(int flag_true_if_should_flip);
 583-
 584-// ZLIB client - used by PNG, available for other purposes
 585-
 586-STBIDEF char *
 587-stbi_zlib_decode_malloc_guesssize(const char *buffer, int len, int initial_size,
 588-                                  int *outlen);
 589-STBIDEF char *
 590-stbi_zlib_decode_malloc_guesssize_headerflag(const char *buffer, int len,
 591-                                             int initial_size, int *outlen,
 592-                                             int parse_header);
 593-STBIDEF char *
 594-stbi_zlib_decode_malloc(const char *buffer, int len, int *outlen);
 595-STBIDEF int
 596-stbi_zlib_decode_buffer(char *obuffer, int olen, const char *ibuffer, int ilen);
 597-
 598-STBIDEF char *
 599-stbi_zlib_decode_noheader_malloc(const char *buffer, int len, int *outlen);
 600-STBIDEF int
 601-stbi_zlib_decode_noheader_buffer(char *obuffer, int olen, const char *ibuffer,
 602-                                 int ilen);
 603-
 604-#ifdef __cplusplus
 605-}
 606-#endif
 607-
 608-//
 609-//
 610-////   end header file   /////////////////////////////////////////////////////
 611-#endif // STBI_INCLUDE_STB_IMAGE_H
 612-
 613-#ifdef STB_IMAGE_IMPLEMENTATION
 614-
 615-#if defined(STBI_ONLY_JPEG) || defined(STBI_ONLY_PNG) ||                       \
 616-    defined(STBI_ONLY_BMP) || defined(STBI_ONLY_TGA) ||                        \
 617-    defined(STBI_ONLY_GIF) || defined(STBI_ONLY_PSD) ||                        \
 618-    defined(STBI_ONLY_HDR) || defined(STBI_ONLY_PIC) ||                        \
 619-    defined(STBI_ONLY_PNM) || defined(STBI_ONLY_ZLIB)
 620-#ifndef STBI_ONLY_JPEG
 621-#define STBI_NO_JPEG
 622-#endif
 623-#ifndef STBI_ONLY_PNG
 624-#define STBI_NO_PNG
 625-#endif
 626-#ifndef STBI_ONLY_BMP
 627-#define STBI_NO_BMP
 628-#endif
 629-#ifndef STBI_ONLY_PSD
 630-#define STBI_NO_PSD
 631-#endif
 632-#ifndef STBI_ONLY_TGA
 633-#define STBI_NO_TGA
 634-#endif
 635-#ifndef STBI_ONLY_GIF
 636-#define STBI_NO_GIF
 637-#endif
 638-#ifndef STBI_ONLY_HDR
 639-#define STBI_NO_HDR
 640-#endif
 641-#ifndef STBI_ONLY_PIC
 642-#define STBI_NO_PIC
 643-#endif
 644-#ifndef STBI_ONLY_PNM
 645-#define STBI_NO_PNM
 646-#endif
 647-#endif
 648-
 649-#if defined(STBI_NO_PNG) && !defined(STBI_SUPPORT_ZLIB) &&                     \
 650-    !defined(STBI_NO_ZLIB)
 651-#define STBI_NO_ZLIB
 652-#endif
 653-
 654-#include <limits.h>
 655-#include <stdarg.h>
 656-#include <stddef.h> // ptrdiff_t on osx
 657-#include <stdlib.h>
 658-#include <string.h>
 659-
 660-#if !defined(STBI_NO_LINEAR) || !defined(STBI_NO_HDR)
 661-#include <math.h> // ldexp, pow
 662-#endif
 663-
 664-#ifndef STBI_NO_STDIO
 665-#include <stdio.h>
 666-#endif
 667-
 668-#ifndef STBI_ASSERT
 669-#include <assert.h>
 670-#define STBI_ASSERT(x) assert(x)
 671-#endif
 672-
 673-#ifdef __cplusplus
 674-#define STBI_EXTERN extern "C"
 675-#else
 676-#define STBI_EXTERN extern
 677-#endif
 678-
 679-#ifndef _MSC_VER
 680-#ifdef __cplusplus
 681-#define stbi_inline inline
 682-#else
 683-#define stbi_inline
 684-#endif
 685-#else
 686-#define stbi_inline __forceinline
 687-#endif
 688-
 689-#ifndef STBI_NO_THREAD_LOCALS
 690-#if defined(__cplusplus) && __cplusplus >= 201103L
 691-#define STBI_THREAD_LOCAL thread_local
 692-#elif defined(__GNUC__) && __GNUC__ < 5
 693-#define STBI_THREAD_LOCAL __thread
 694-#elif defined(_MSC_VER)
 695-#define STBI_THREAD_LOCAL __declspec(thread)
 696-#elif defined(__STDC_VERSION__) && __STDC_VERSION__ >= 201112L &&              \
 697-    !defined(__STDC_NO_THREADS__)
 698-#define STBI_THREAD_LOCAL _Thread_local
 699-#endif
 700-
 701-#ifndef STBI_THREAD_LOCAL
 702-#if defined(__GNUC__)
 703-#define STBI_THREAD_LOCAL __thread
 704-#endif
 705-#endif
 706-#endif
 707-
 708-#if defined(_MSC_VER) || defined(__SYMBIAN32__)
 709-typedef unsigned short stbi__uint16;
 710-typedef signed short stbi__int16;
 711-typedef unsigned int stbi__uint32;
 712-typedef signed int stbi__int32;
 713-#else
 714-#include <stdint.h>
 715-typedef uint16_t stbi__uint16;
 716-typedef int16_t stbi__int16;
 717-typedef uint32_t stbi__uint32;
 718-typedef int32_t stbi__int32;
 719-#endif
 720-
 721-// should produce compiler error if size is wrong
 722-typedef unsigned char validate_uint32[sizeof(stbi__uint32) == 4 ? 1 : -1];
 723-
 724-#ifdef _MSC_VER
 725-#define STBI_NOTUSED(v) (void)(v)
 726-#else
 727-#define STBI_NOTUSED(v) (void)sizeof(v)
 728-#endif
 729-
 730-#ifdef _MSC_VER
 731-#define STBI_HAS_LROTL
 732-#endif
 733-
 734-#ifdef STBI_HAS_LROTL
 735-#define stbi_lrot(x, y) _lrotl(x, y)
 736-#else
 737-#define stbi_lrot(x, y) (((x) << (y)) | ((x) >> (-(y) & 31)))
 738-#endif
 739-
 740-#if defined(STBI_MALLOC) && defined(STBI_FREE) &&                              \
 741-    (defined(STBI_REALLOC) || defined(STBI_REALLOC_SIZED))
 742-// ok
 743-#elif !defined(STBI_MALLOC) && !defined(STBI_FREE) &&                          \
 744-    !defined(STBI_REALLOC) && !defined(STBI_REALLOC_SIZED)
 745-// ok
 746-#else
 747-#error                                                                         \
 748-    "Must define all or none of STBI_MALLOC, STBI_FREE, and STBI_REALLOC (or STBI_REALLOC_SIZED)."
 749-#endif
 750-
 751-#ifndef STBI_MALLOC
 752-#define STBI_MALLOC(sz) malloc(sz)
 753-#define STBI_REALLOC(p, newsz) realloc(p, newsz)
 754-#define STBI_FREE(p) free(p)
 755-#endif
 756-
 757-#ifndef STBI_REALLOC_SIZED
 758-#define STBI_REALLOC_SIZED(p, oldsz, newsz) STBI_REALLOC(p, newsz)
 759-#endif
 760-
 761-// x86/x64 detection
 762-#if defined(__x86_64__) || defined(_M_X64)
 763-#define STBI__X64_TARGET
 764-#elif defined(__i386) || defined(_M_IX86)
 765-#define STBI__X86_TARGET
 766-#endif
 767-
 768-#if defined(__GNUC__) && defined(STBI__X86_TARGET) && !defined(__SSE2__) &&    \
 769-    !defined(STBI_NO_SIMD)
 770-// gcc doesn't support sse2 intrinsics unless you compile with -msse2,
 771-// which in turn means it gets to use SSE2 everywhere. This is unfortunate,
 772-// but previous attempts to provide the SSE2 functions with runtime
 773-// detection caused numerous issues. The way architecture extensions are
 774-// exposed in GCC/Clang is, sadly, not really suited for one-file libs.
 775-// New behavior: if compiled with -msse2, we use SSE2 without any
 776-// detection; if not, we don't use it at all.
 777-#define STBI_NO_SIMD
 778-#endif
 779-
 780-#if defined(__MINGW32__) && defined(STBI__X86_TARGET) &&                       \
 781-    !defined(STBI_MINGW_ENABLE_SSE2) && !defined(STBI_NO_SIMD)
 782-// Note that __MINGW32__ doesn't actually mean 32-bit, so we have to avoid
 783-// STBI__X64_TARGET
 784-//
 785-// 32-bit MinGW wants ESP to be 16-byte aligned, but this is not in the
 786-// Windows ABI and VC++ as well as Windows DLLs don't maintain that invariant.
 787-// As a result, enabling SSE2 on 32-bit MinGW is dangerous when not
 788-// simultaneously enabling "-mstackrealign".
 789-//
 790-// See https://github.com/nothings/stb/issues/81 for more information.
 791-//
 792-// So default to no SSE2 on 32-bit MinGW. If you've read this far and added
 793-// -mstackrealign to your build settings, feel free to #define
 794-// STBI_MINGW_ENABLE_SSE2.
 795-#define STBI_NO_SIMD
 796-#endif
 797-
 798-#if !defined(STBI_NO_SIMD) &&                                                  \
 799-    (defined(STBI__X86_TARGET) || defined(STBI__X64_TARGET))
 800-#define STBI_SSE2
 801-#include <emmintrin.h>
 802-
 803-#ifdef _MSC_VER
 804-
 805-#if _MSC_VER >= 1400 // not VC6
 806-#include <intrin.h>  // __cpuid
 807-static int
 808-stbi__cpuid3(void)
 809-{
 810-	int info[4];
 811-	__cpuid(info, 1);
 812-	return info[3];
 813-}
 814-#else
 815-static int
 816-stbi__cpuid3(void)
 817-{
 818-	int res;
 819-	__asm {
 820-      mov  eax,1
 821-      cpuid
 822-      mov  res,edx
 823-	}
 824-	return res;
 825-}
 826-#endif
 827-
 828-#define STBI_SIMD_ALIGN(type, name) __declspec(align(16)) type name
 829-
 830-#if !defined(STBI_NO_JPEG) && defined(STBI_SSE2)
 831-static int
 832-stbi__sse2_available(void)
 833-{
 834-	int info3 = stbi__cpuid3();
 835-	return ((info3 >> 26) & 1) != 0;
 836-}
 837-#endif
 838-
 839-#else // assume GCC-style if not VC++
 840-#define STBI_SIMD_ALIGN(type, name) type name __attribute__((aligned(16)))
 841-
 842-#if !defined(STBI_NO_JPEG) && defined(STBI_SSE2)
 843-static int
 844-stbi__sse2_available(void)
 845-{
 846-	// If we're even attempting to compile this on GCC/Clang, that means
 847-	// -msse2 is on, which means the compiler is allowed to use SSE2
 848-	// instructions at will, and so are we.
 849-	return 1;
 850-}
 851-#endif
 852-
 853-#endif
 854-#endif
 855-
 856-// ARM NEON
 857-#if defined(STBI_NO_SIMD) && defined(STBI_NEON)
 858-#undef STBI_NEON
 859-#endif
 860-
 861-#ifdef STBI_NEON
 862-#include <arm_neon.h>
 863-#ifdef _MSC_VER
 864-#define STBI_SIMD_ALIGN(type, name) __declspec(align(16)) type name
 865-#else
 866-#define STBI_SIMD_ALIGN(type, name) type name __attribute__((aligned(16)))
 867-#endif
 868-#endif
 869-
 870-#ifndef STBI_SIMD_ALIGN
 871-#define STBI_SIMD_ALIGN(type, name) type name
 872-#endif
 873-
 874-#ifndef STBI_MAX_DIMENSIONS
 875-#define STBI_MAX_DIMENSIONS (1 << 24)
 876-#endif
 877-
 878-///////////////////////////////////////////////
 879-//
 880-//  stbi__context struct and start_xxx functions
 881-
 882-// stbi__context structure is our basic context used by all images, so it
 883-// contains all the IO context, plus some basic image information
 884-typedef struct {
 885-	stbi__uint32 img_x, img_y;
 886-	int img_n, img_out_n;
 887-
 888-	stbi_io_callbacks io;
 889-	void *io_user_data;
 890-
 891-	int read_from_callbacks;
 892-	int buflen;
 893-	stbi_uc buffer_start[128];
 894-	int callback_already_read;
 895-
 896-	stbi_uc *img_buffer, *img_buffer_end;
 897-	stbi_uc *img_buffer_original, *img_buffer_original_end;
 898-} stbi__context;
 899-
 900-static void
 901-stbi__refill_buffer(stbi__context *s);
 902-
 903-// initialize a memory-decode context
 904-static void
 905-stbi__start_mem(stbi__context *s, stbi_uc const *buffer, int len)
 906-{
 907-	s->io.read = NULL;
 908-	s->read_from_callbacks = 0;
 909-	s->callback_already_read = 0;
 910-	s->img_buffer = s->img_buffer_original = (stbi_uc *)buffer;
 911-	s->img_buffer_end = s->img_buffer_original_end = (stbi_uc *)buffer + len;
 912-}
 913-
 914-// initialize a callback-based context
 915-static void
 916-stbi__start_callbacks(stbi__context *s, stbi_io_callbacks *c, void *user)
 917-{
 918-	s->io = *c;
 919-	s->io_user_data = user;
 920-	s->buflen = sizeof(s->buffer_start);
 921-	s->read_from_callbacks = 1;
 922-	s->callback_already_read = 0;
 923-	s->img_buffer = s->img_buffer_original = s->buffer_start;
 924-	stbi__refill_buffer(s);
 925-	s->img_buffer_original_end = s->img_buffer_end;
 926-}
 927-
 928-#ifndef STBI_NO_STDIO
 929-
 930-static int
 931-stbi__stdio_read(void *user, char *data, int size)
 932-{
 933-	return (int)fread(data, 1, size, (FILE *)user);
 934-}
 935-
 936-static void
 937-stbi__stdio_skip(void *user, int n)
 938-{
 939-	int ch;
 940-	fseek((FILE *)user, n, SEEK_CUR);
 941-	ch = fgetc((FILE *)user); /* have to read a byte to reset feof()'s flag */
 942-	if (ch != EOF) {
 943-		ungetc(ch, (FILE *)user); /* push byte back onto stream if valid. */
 944-	}
 945-}
 946-
 947-static int
 948-stbi__stdio_eof(void *user)
 949-{
 950-	return feof((FILE *)user) || ferror((FILE *)user);
 951-}
 952-
 953-static stbi_io_callbacks stbi__stdio_callbacks = {
 954-    stbi__stdio_read,
 955-    stbi__stdio_skip,
 956-    stbi__stdio_eof,
 957-};
 958-
 959-static void
 960-stbi__start_file(stbi__context *s, FILE *f)
 961-{
 962-	stbi__start_callbacks(s, &stbi__stdio_callbacks, (void *)f);
 963-}
 964-
 965-// static void stop_file(stbi__context *s) { }
 966-
 967-#endif // !STBI_NO_STDIO
 968-
 969-static void
 970-stbi__rewind(stbi__context *s)
 971-{
 972-	// conceptually rewind SHOULD rewind to the beginning of the stream,
 973-	// but we just rewind to the beginning of the initial buffer, because
 974-	// we only use it after doing 'test', which only ever looks at at most 92
 975-	// bytes
 976-	s->img_buffer = s->img_buffer_original;
 977-	s->img_buffer_end = s->img_buffer_original_end;
 978-}
 979-
 980-enum { STBI_ORDER_RGB, STBI_ORDER_BGR };
 981-
 982-typedef struct {
 983-	int bits_per_channel;
 984-	int num_channels;
 985-	int channel_order;
 986-} stbi__result_info;
 987-
 988-#ifndef STBI_NO_JPEG
 989-static int
 990-stbi__jpeg_test(stbi__context *s);
 991-static void *
 992-stbi__jpeg_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
 993-                stbi__result_info *ri);
 994-static int
 995-stbi__jpeg_info(stbi__context *s, int *x, int *y, int *comp);
 996-#endif
 997-
 998-#ifndef STBI_NO_PNG
 999-static int
1000-stbi__png_test(stbi__context *s);
1001-static void *
1002-stbi__png_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
1003-               stbi__result_info *ri);
1004-static int
1005-stbi__png_info(stbi__context *s, int *x, int *y, int *comp);
1006-static int
1007-stbi__png_is16(stbi__context *s);
1008-#endif
1009-
1010-#ifndef STBI_NO_BMP
1011-static int
1012-stbi__bmp_test(stbi__context *s);
1013-static void *
1014-stbi__bmp_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
1015-               stbi__result_info *ri);
1016-static int
1017-stbi__bmp_info(stbi__context *s, int *x, int *y, int *comp);
1018-#endif
1019-
1020-#ifndef STBI_NO_TGA
1021-static int
1022-stbi__tga_test(stbi__context *s);
1023-static void *
1024-stbi__tga_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
1025-               stbi__result_info *ri);
1026-static int
1027-stbi__tga_info(stbi__context *s, int *x, int *y, int *comp);
1028-#endif
1029-
1030-#ifndef STBI_NO_PSD
1031-static int
1032-stbi__psd_test(stbi__context *s);
1033-static void *
1034-stbi__psd_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
1035-               stbi__result_info *ri, int bpc);
1036-static int
1037-stbi__psd_info(stbi__context *s, int *x, int *y, int *comp);
1038-static int
1039-stbi__psd_is16(stbi__context *s);
1040-#endif
1041-
1042-#ifndef STBI_NO_HDR
1043-static int
1044-stbi__hdr_test(stbi__context *s);
1045-static float *
1046-stbi__hdr_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
1047-               stbi__result_info *ri);
1048-static int
1049-stbi__hdr_info(stbi__context *s, int *x, int *y, int *comp);
1050-#endif
1051-
1052-#ifndef STBI_NO_PIC
1053-static int
1054-stbi__pic_test(stbi__context *s);
1055-static void *
1056-stbi__pic_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
1057-               stbi__result_info *ri);
1058-static int
1059-stbi__pic_info(stbi__context *s, int *x, int *y, int *comp);
1060-#endif
1061-
1062-#ifndef STBI_NO_GIF
1063-static int
1064-stbi__gif_test(stbi__context *s);
1065-static void *
1066-stbi__gif_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
1067-               stbi__result_info *ri);
1068-static void *
1069-stbi__load_gif_main(stbi__context *s, int **delays, int *x, int *y, int *z,
1070-                    int *comp, int req_comp);
1071-static int
1072-stbi__gif_info(stbi__context *s, int *x, int *y, int *comp);
1073-#endif
1074-
1075-#ifndef STBI_NO_PNM
1076-static int
1077-stbi__pnm_test(stbi__context *s);
1078-static void *
1079-stbi__pnm_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
1080-               stbi__result_info *ri);
1081-static int
1082-stbi__pnm_info(stbi__context *s, int *x, int *y, int *comp);
1083-static int
1084-stbi__pnm_is16(stbi__context *s);
1085-#endif
1086-
1087-static
1088-#ifdef STBI_THREAD_LOCAL
1089-    STBI_THREAD_LOCAL
1090-#endif
1091-    const char *stbi__g_failure_reason;
1092-
1093-STBIDEF const char *
1094-stbi_failure_reason(void)
1095-{
1096-	return stbi__g_failure_reason;
1097-}
1098-
1099-#ifndef STBI_NO_FAILURE_STRINGS
1100-static int
1101-stbi__err(const char *str)
1102-{
1103-	stbi__g_failure_reason = str;
1104-	return 0;
1105-}
1106-#endif
1107-
1108-static void *
1109-stbi__malloc(size_t size)
1110-{
1111-	return STBI_MALLOC(size);
1112-}
1113-
1114-// stb_image uses ints pervasively, including for offset calculations.
1115-// therefore the largest decoded image size we can support with the
1116-// current code, even on 64-bit targets, is INT_MAX. this is not a
1117-// significant limitation for the intended use case.
1118-//
1119-// we do, however, need to make sure our size calculations don't
1120-// overflow. hence a few helper functions for size calculations that
1121-// multiply integers together, making sure that they're non-negative
1122-// and no overflow occurs.
1123-
1124-// return 1 if the sum is valid, 0 on overflow.
1125-// negative terms are considered invalid.
1126-static int
1127-stbi__addsizes_valid(int a, int b)
1128-{
1129-	if (b < 0) {
1130-		return 0;
1131-	}
1132-	// now 0 <= b <= INT_MAX, hence also
1133-	// 0 <= INT_MAX - b <= INTMAX.
1134-	// And "a + b <= INT_MAX" (which might overflow) is the
1135-	// same as a <= INT_MAX - b (no overflow)
1136-	return a <= INT_MAX - b;
1137-}
1138-
1139-// returns 1 if the product is valid, 0 on overflow.
1140-// negative factors are considered invalid.
1141-static int
1142-stbi__mul2sizes_valid(int a, int b)
1143-{
1144-	if (a < 0 || b < 0) {
1145-		return 0;
1146-	}
1147-	if (b == 0) {
1148-		return 1; // mul-by-0 is always safe
1149-	}
1150-	// portable way to check for no overflows in a*b
1151-	return a <= INT_MAX / b;
1152-}
1153-
1154-#if !defined(STBI_NO_JPEG) || !defined(STBI_NO_PNG) ||                         \
1155-    !defined(STBI_NO_TGA) || !defined(STBI_NO_HDR)
1156-// returns 1 if "a*b + add" has no negative terms/factors and doesn't overflow
1157-static int
1158-stbi__mad2sizes_valid(int a, int b, int add)
1159-{
1160-	return stbi__mul2sizes_valid(a, b) && stbi__addsizes_valid(a * b, add);
1161-}
1162-#endif
1163-
1164-// returns 1 if "a*b*c + add" has no negative terms/factors and doesn't overflow
1165-static int
1166-stbi__mad3sizes_valid(int a, int b, int c, int add)
1167-{
1168-	return stbi__mul2sizes_valid(a, b) && stbi__mul2sizes_valid(a * b, c) &&
1169-	       stbi__addsizes_valid(a * b * c, add);
1170-}
1171-
1172-// returns 1 if "a*b*c*d + add" has no negative terms/factors and doesn't
1173-// overflow
1174-#if !defined(STBI_NO_LINEAR) || !defined(STBI_NO_HDR) || !defined(STBI_NO_PNM)
1175-static int
1176-stbi__mad4sizes_valid(int a, int b, int c, int d, int add)
1177-{
1178-	return stbi__mul2sizes_valid(a, b) && stbi__mul2sizes_valid(a * b, c) &&
1179-	       stbi__mul2sizes_valid(a * b * c, d) &&
1180-	       stbi__addsizes_valid(a * b * c * d, add);
1181-}
1182-#endif
1183-
1184-#if !defined(STBI_NO_JPEG) || !defined(STBI_NO_PNG) ||                         \
1185-    !defined(STBI_NO_TGA) || !defined(STBI_NO_HDR)
1186-// mallocs with size overflow checking
1187-static void *
1188-stbi__malloc_mad2(int a, int b, int add)
1189-{
1190-	if (!stbi__mad2sizes_valid(a, b, add)) {
1191-		return NULL;
1192-	}
1193-	return stbi__malloc(a * b + add);
1194-}
1195-#endif
1196-
1197-static void *
1198-stbi__malloc_mad3(int a, int b, int c, int add)
1199-{
1200-	if (!stbi__mad3sizes_valid(a, b, c, add)) {
1201-		return NULL;
1202-	}
1203-	return stbi__malloc(a * b * c + add);
1204-}
1205-
1206-#if !defined(STBI_NO_LINEAR) || !defined(STBI_NO_HDR) || !defined(STBI_NO_PNM)
1207-static void *
1208-stbi__malloc_mad4(int a, int b, int c, int d, int add)
1209-{
1210-	if (!stbi__mad4sizes_valid(a, b, c, d, add)) {
1211-		return NULL;
1212-	}
1213-	return stbi__malloc(a * b * c * d + add);
1214-}
1215-#endif
1216-
1217-// returns 1 if the sum of two signed ints is valid (between -2^31 and 2^31-1
1218-// inclusive), 0 on overflow.
1219-static int
1220-stbi__addints_valid(int a, int b)
1221-{
1222-	if ((a >= 0) != (b >= 0)) {
1223-		return 1; // a and b have different signs, so no overflow
1224-	}
1225-	if (a < 0 && b < 0) {
1226-		return a >= INT_MIN - b; // same as a + b >= INT_MIN; INT_MIN - b cannot
1227-		                         // overflow since b < 0.
1228-	}
1229-	return a <= INT_MAX - b;
1230-}
1231-
1232-// returns 1 if the product of two ints fits in a signed short, 0 on overflow.
1233-static int
1234-stbi__mul2shorts_valid(int a, int b)
1235-{
1236-	if (b == 0 || b == -1) {
1237-		return 1; // multiplication by 0 is always 0; check for -1 so SHRT_MIN/b
1238-		          // doesn't overflow
1239-	}
1240-	if ((a >= 0) == (b >= 0)) {
1241-		return a <= SHRT_MAX /
1242-		                b; // product is positive, so similar to mul2sizes_valid
1243-	}
1244-	if (b < 0) {
1245-		return a <= SHRT_MIN / b; // same as a * b >= SHRT_MIN
1246-	}
1247-	return a >= SHRT_MIN / b;
1248-}
1249-
1250-// stbi__err - error
1251-// stbi__errpf - error returning pointer to float
1252-// stbi__errpuc - error returning pointer to unsigned char
1253-
1254-#ifdef STBI_NO_FAILURE_STRINGS
1255-#define stbi__err(x, y) 0
1256-#elif defined(STBI_FAILURE_USERMSG)
1257-#define stbi__err(x, y) stbi__err(y)
1258-#else
1259-#define stbi__err(x, y) stbi__err(x)
1260-#endif
1261-
1262-#define stbi__errpf(x, y) ((float *)(size_t)(stbi__err(x, y) ? NULL : NULL))
1263-#define stbi__errpuc(x, y)                                                     \
1264-	((unsigned char *)(size_t)(stbi__err(x, y) ? NULL : NULL))
1265-
1266-STBIDEF void
1267-stbi_image_free(void *retval_from_stbi_load)
1268-{
1269-	STBI_FREE(retval_from_stbi_load);
1270-}
1271-
1272-#ifndef STBI_NO_LINEAR
1273-static float *
1274-stbi__ldr_to_hdr(stbi_uc *data, int x, int y, int comp);
1275-#endif
1276-
1277-#ifndef STBI_NO_HDR
1278-static stbi_uc *
1279-stbi__hdr_to_ldr(float *data, int x, int y, int comp);
1280-#endif
1281-
1282-static int stbi__vertically_flip_on_load_global = 0;
1283-
1284-STBIDEF void
1285-stbi_set_flip_vertically_on_load(int flag_true_if_should_flip)
1286-{
1287-	stbi__vertically_flip_on_load_global = flag_true_if_should_flip;
1288-}
1289-
1290-#ifndef STBI_THREAD_LOCAL
1291-#define stbi__vertically_flip_on_load stbi__vertically_flip_on_load_global
1292-#else
1293-static STBI_THREAD_LOCAL int stbi__vertically_flip_on_load_local,
1294-    stbi__vertically_flip_on_load_set;
1295-
1296-STBIDEF void
1297-stbi_set_flip_vertically_on_load_thread(int flag_true_if_should_flip)
1298-{
1299-	stbi__vertically_flip_on_load_local = flag_true_if_should_flip;
1300-	stbi__vertically_flip_on_load_set = 1;
1301-}
1302-
1303-#define stbi__vertically_flip_on_load                                          \
1304-	(stbi__vertically_flip_on_load_set ? stbi__vertically_flip_on_load_local   \
1305-	                                   : stbi__vertically_flip_on_load_global)
1306-#endif // STBI_THREAD_LOCAL
1307-
1308-static void *
1309-stbi__load_main(stbi__context *s, int *x, int *y, int *comp, int req_comp,
1310-                stbi__result_info *ri, int bpc)
1311-{
1312-	memset(ri, 0,
1313-	       sizeof(*ri)); // make sure it's initialized if we add new fields
1314-	ri->bits_per_channel =
1315-	    8; // default is 8 so most paths don't have to be changed
1316-	ri->channel_order =
1317-	    STBI_ORDER_RGB; // all current input & output are this, but this is here
1318-	                    // so we can add BGR order
1319-	ri->num_channels = 0;
1320-
1321-// test the formats with a very explicit header first (at least a FOURCC
1322-// or distinctive magic number first)
1323-#ifndef STBI_NO_PNG
1324-	if (stbi__png_test(s)) {
1325-		return stbi__png_load(s, x, y, comp, req_comp, ri);
1326-	}
1327-#endif
1328-#ifndef STBI_NO_BMP
1329-	if (stbi__bmp_test(s)) {
1330-		return stbi__bmp_load(s, x, y, comp, req_comp, ri);
1331-	}
1332-#endif
1333-#ifndef STBI_NO_GIF
1334-	if (stbi__gif_test(s)) {
1335-		return stbi__gif_load(s, x, y, comp, req_comp, ri);
1336-	}
1337-#endif
1338-#ifndef STBI_NO_PSD
1339-	if (stbi__psd_test(s)) {
1340-		return stbi__psd_load(s, x, y, comp, req_comp, ri, bpc);
1341-	}
1342-#else
1343-	STBI_NOTUSED(bpc);
1344-#endif
1345-#ifndef STBI_NO_PIC
1346-	if (stbi__pic_test(s)) {
1347-		return stbi__pic_load(s, x, y, comp, req_comp, ri);
1348-	}
1349-#endif
1350-
1351-// then the formats that can end up attempting to load with just 1 or 2
1352-// bytes matching expectations; these are prone to false positives, so
1353-// try them later
1354-#ifndef STBI_NO_JPEG
1355-	if (stbi__jpeg_test(s)) {
1356-		return stbi__jpeg_load(s, x, y, comp, req_comp, ri);
1357-	}
1358-#endif
1359-#ifndef STBI_NO_PNM
1360-	if (stbi__pnm_test(s)) {
1361-		return stbi__pnm_load(s, x, y, comp, req_comp, ri);
1362-	}
1363-#endif
1364-
1365-#ifndef STBI_NO_HDR
1366-	if (stbi__hdr_test(s)) {
1367-		float *hdr = stbi__hdr_load(s, x, y, comp, req_comp, ri);
1368-		return stbi__hdr_to_ldr(hdr, *x, *y, req_comp ? req_comp : *comp);
1369-	}
1370-#endif
1371-
1372-#ifndef STBI_NO_TGA
1373-	// test tga last because it's a crappy test!
1374-	if (stbi__tga_test(s)) {
1375-		return stbi__tga_load(s, x, y, comp, req_comp, ri);
1376-	}
1377-#endif
1378-
1379-	return stbi__errpuc("unknown image type",
1380-	                    "Image not of any known type, or corrupt");
1381-}
1382-
1383-static stbi_uc *
1384-stbi__convert_16_to_8(stbi__uint16 *orig, int w, int h, int channels)
1385-{
1386-	int i;
1387-	int img_len = w * h * channels;
1388-	stbi_uc *reduced;
1389-
1390-	reduced = (stbi_uc *)stbi__malloc(img_len);
1391-	if (reduced == NULL) {
1392-		return stbi__errpuc("outofmem", "Out of memory");
1393-	}
1394-
1395-	for (i = 0; i < img_len; ++i) {
1396-		reduced[i] = (stbi_uc)((orig[i] >> 8) &
1397-		                       0xFF); // top half of each byte is sufficient
1398-		                              // approx of 16->8 bit scaling
1399-	}
1400-
1401-	STBI_FREE(orig);
1402-	return reduced;
1403-}
1404-
1405-static stbi__uint16 *
1406-stbi__convert_8_to_16(stbi_uc *orig, int w, int h, int channels)
1407-{
1408-	int i;
1409-	int img_len = w * h * channels;
1410-	stbi__uint16 *enlarged;
1411-
1412-	enlarged = (stbi__uint16 *)stbi__malloc(img_len * 2);
1413-	if (enlarged == NULL) {
1414-		return (stbi__uint16 *)stbi__errpuc("outofmem", "Out of memory");
1415-	}
1416-
1417-	for (i = 0; i < img_len; ++i) {
1418-		enlarged[i] = (stbi__uint16)((orig[i] << 8) +
1419-		                             orig[i]); // replicate to high and low
1420-		                                       // byte, maps 0->0, 255->0xffff
1421-	}
1422-
1423-	STBI_FREE(orig);
1424-	return enlarged;
1425-}
1426-
1427-static void
1428-stbi__vertical_flip(void *image, int w, int h, int bytes_per_pixel)
1429-{
1430-	int row;
1431-	size_t bytes_per_row = (size_t)w * bytes_per_pixel;
1432-	stbi_uc temp[2048];
1433-	stbi_uc *bytes = (stbi_uc *)image;
1434-
1435-	for (row = 0; row < (h >> 1); row++) {
1436-		stbi_uc *row0 = bytes + row * bytes_per_row;
1437-		stbi_uc *row1 = bytes + (h - row - 1) * bytes_per_row;
1438-		// swap row0 with row1
1439-		size_t bytes_left = bytes_per_row;
1440-		while (bytes_left) {
1441-			size_t bytes_copy =
1442-			    (bytes_left < sizeof(temp)) ? bytes_left : sizeof(temp);
1443-			memcpy(temp, row0, bytes_copy);
1444-			memcpy(row0, row1, bytes_copy);
1445-			memcpy(row1, temp, bytes_copy);
1446-			row0 += bytes_copy;
1447-			row1 += bytes_copy;
1448-			bytes_left -= bytes_copy;
1449-		}
1450-	}
1451-}
1452-
1453-#ifndef STBI_NO_GIF
1454-static void
1455-stbi__vertical_flip_slices(void *image, int w, int h, int z,
1456-                           int bytes_per_pixel)
1457-{
1458-	int slice;
1459-	int slice_size = w * h * bytes_per_pixel;
1460-
1461-	stbi_uc *bytes = (stbi_uc *)image;
1462-	for (slice = 0; slice < z; ++slice) {
1463-		stbi__vertical_flip(bytes, w, h, bytes_per_pixel);
1464-		bytes += slice_size;
1465-	}
1466-}
1467-#endif
1468-
1469-static unsigned char *
1470-stbi__load_and_postprocess_8bit(stbi__context *s, int *x, int *y, int *comp,
1471-                                int req_comp)
1472-{
1473-	stbi__result_info ri;
1474-	void *result = stbi__load_main(s, x, y, comp, req_comp, &ri, 8);
1475-
1476-	if (result == NULL) {
1477-		return NULL;
1478-	}
1479-
1480-	// it is the responsibility of the loaders to make sure we get either 8 or
1481-	// 16 bit.
1482-	STBI_ASSERT(ri.bits_per_channel == 8 || ri.bits_per_channel == 16);
1483-
1484-	if (ri.bits_per_channel != 8) {
1485-		result = stbi__convert_16_to_8((stbi__uint16 *)result, *x, *y,
1486-		                               req_comp == 0 ? *comp : req_comp);
1487-		ri.bits_per_channel = 8;
1488-	}
1489-
1490-	// @TODO: move stbi__convert_format to here
1491-
1492-	if (stbi__vertically_flip_on_load) {
1493-		int channels = req_comp ? req_comp : *comp;
1494-		stbi__vertical_flip(result, *x, *y, channels * sizeof(stbi_uc));
1495-	}
1496-
1497-	return (unsigned char *)result;
1498-}
1499-
1500-static stbi__uint16 *
1501-stbi__load_and_postprocess_16bit(stbi__context *s, int *x, int *y, int *comp,
1502-                                 int req_comp)
1503-{
1504-	stbi__result_info ri;
1505-	void *result = stbi__load_main(s, x, y, comp, req_comp, &ri, 16);
1506-
1507-	if (result == NULL) {
1508-		return NULL;
1509-	}
1510-
1511-	// it is the responsibility of the loaders to make sure we get either 8 or
1512-	// 16 bit.
1513-	STBI_ASSERT(ri.bits_per_channel == 8 || ri.bits_per_channel == 16);
1514-
1515-	if (ri.bits_per_channel != 16) {
1516-		result = stbi__convert_8_to_16((stbi_uc *)result, *x, *y,
1517-		                               req_comp == 0 ? *comp : req_comp);
1518-		ri.bits_per_channel = 16;
1519-	}
1520-
1521-	// @TODO: move stbi__convert_format16 to here
1522-	// @TODO: special case RGB-to-Y (and RGBA-to-YA) for 8-bit-to-16-bit case to
1523-	// keep more precision
1524-
1525-	if (stbi__vertically_flip_on_load) {
1526-		int channels = req_comp ? req_comp : *comp;
1527-		stbi__vertical_flip(result, *x, *y, channels * sizeof(stbi__uint16));
1528-	}
1529-
1530-	return (stbi__uint16 *)result;
1531-}
1532-
1533-#if !defined(STBI_NO_HDR) && !defined(STBI_NO_LINEAR)
1534-static void
1535-stbi__float_postprocess(float *result, int *x, int *y, int *comp, int req_comp)
1536-{
1537-	if (stbi__vertically_flip_on_load && result != NULL) {
1538-		int channels = req_comp ? req_comp : *comp;
1539-		stbi__vertical_flip(result, *x, *y, channels * sizeof(float));
1540-	}
1541-}
1542-#endif
1543-
1544-#ifndef STBI_NO_STDIO
1545-
1546-#if defined(_WIN32) && defined(STBI_WINDOWS_UTF8)
1547-STBI_EXTERN __declspec(dllimport) int __stdcall MultiByteToWideChar(
1548-    unsigned int cp, unsigned long flags, const char *str, int cbmb,
1549-    wchar_t *widestr, int cchwide);
1550-STBI_EXTERN __declspec(dllimport) int __stdcall WideCharToMultiByte(
1551-    unsigned int cp, unsigned long flags, const wchar_t *widestr, int cchwide,
1552-    char *str, int cbmb, const char *defchar, int *used_default);
1553-#endif
1554-
1555-#if defined(_WIN32) && defined(STBI_WINDOWS_UTF8)
1556-STBIDEF int
1557-stbi_convert_wchar_to_utf8(char *buffer, size_t bufferlen, const wchar_t *input)
1558-{
1559-	return WideCharToMultiByte(65001 /* UTF8 */, 0, input, -1, buffer,
1560-	                           (int)bufferlen, NULL, NULL);
1561-}
1562-#endif
1563-
1564-static FILE *
1565-stbi__fopen(char const *filename, char const *mode)
1566-{
1567-	FILE *f;
1568-#if defined(_WIN32) && defined(STBI_WINDOWS_UTF8)
1569-	wchar_t wMode[64];
1570-	wchar_t wFilename[1024];
1571-	if (0 == MultiByteToWideChar(65001 /* UTF8 */, 0, filename, -1, wFilename,
1572-	                             sizeof(wFilename) / sizeof(*wFilename))) {
1573-		return 0;
1574-	}
1575-
1576-	if (0 == MultiByteToWideChar(65001 /* UTF8 */, 0, mode, -1, wMode,
1577-	                             sizeof(wMode) / sizeof(*wMode))) {
1578-		return 0;
1579-	}
1580-
1581-#if defined(_MSC_VER) && _MSC_VER >= 1400
1582-	if (0 != _wfopen_s(&f, wFilename, wMode)) {
1583-		f = 0;
1584-	}
1585-#else
1586-	f = _wfopen(wFilename, wMode);
1587-#endif
1588-
1589-#elif defined(_MSC_VER) && _MSC_VER >= 1400
1590-	if (0 != fopen_s(&f, filename, mode)) {
1591-		f = 0;
1592-	}
1593-#else
1594-	f = fopen(filename, mode);
1595-#endif
1596-	return f;
1597-}
1598-
1599-STBIDEF stbi_uc *
1600-stbi_load(char const *filename, int *x, int *y, int *comp, int req_comp)
1601-{
1602-	FILE *f = stbi__fopen(filename, "rb");
1603-	unsigned char *result;
1604-	if (!f) {
1605-		return stbi__errpuc("can't fopen", "Unable to open file");
1606-	}
1607-	result = stbi_load_from_file(f, x, y, comp, req_comp);
1608-	fclose(f);
1609-	return result;
1610-}
1611-
1612-STBIDEF stbi_uc *
1613-stbi_load_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
1614-{
1615-	unsigned char *result;
1616-	stbi__context s;
1617-	stbi__start_file(&s, f);
1618-	result = stbi__load_and_postprocess_8bit(&s, x, y, comp, req_comp);
1619-	if (result) {
1620-		// need to 'unget' all the characters in the IO buffer
1621-		fseek(f, -(int)(s.img_buffer_end - s.img_buffer), SEEK_CUR);
1622-	}
1623-	return result;
1624-}
1625-
1626-STBIDEF stbi__uint16 *
1627-stbi_load_from_file_16(FILE *f, int *x, int *y, int *comp, int req_comp)
1628-{
1629-	stbi__uint16 *result;
1630-	stbi__context s;
1631-	stbi__start_file(&s, f);
1632-	result = stbi__load_and_postprocess_16bit(&s, x, y, comp, req_comp);
1633-	if (result) {
1634-		// need to 'unget' all the characters in the IO buffer
1635-		fseek(f, -(int)(s.img_buffer_end - s.img_buffer), SEEK_CUR);
1636-	}
1637-	return result;
1638-}
1639-
1640-STBIDEF stbi_us *
1641-stbi_load_16(char const *filename, int *x, int *y, int *comp, int req_comp)
1642-{
1643-	FILE *f = stbi__fopen(filename, "rb");
1644-	stbi__uint16 *result;
1645-	if (!f) {
1646-		return (stbi_us *)stbi__errpuc("can't fopen", "Unable to open file");
1647-	}
1648-	result = stbi_load_from_file_16(f, x, y, comp, req_comp);
1649-	fclose(f);
1650-	return result;
1651-}
1652-
1653-#endif //! STBI_NO_STDIO
1654-
1655-STBIDEF stbi_us *
1656-stbi_load_16_from_memory(stbi_uc const *buffer, int len, int *x, int *y,
1657-                         int *channels_in_file, int desired_channels)
1658-{
1659-	stbi__context s;
1660-	stbi__start_mem(&s, buffer, len);
1661-	return stbi__load_and_postprocess_16bit(&s, x, y, channels_in_file,
1662-	                                        desired_channels);
1663-}
1664-
1665-STBIDEF stbi_us *
1666-stbi_load_16_from_callbacks(stbi_io_callbacks const *clbk, void *user, int *x,
1667-                            int *y, int *channels_in_file, int desired_channels)
1668-{
1669-	stbi__context s;
1670-	stbi__start_callbacks(&s, (stbi_io_callbacks *)clbk, user);
1671-	return stbi__load_and_postprocess_16bit(&s, x, y, channels_in_file,
1672-	                                        desired_channels);
1673-}
1674-
1675-STBIDEF stbi_uc *
1676-stbi_load_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp,
1677-                      int req_comp)
1678-{
1679-	stbi__context s;
1680-	stbi__start_mem(&s, buffer, len);
1681-	return stbi__load_and_postprocess_8bit(&s, x, y, comp, req_comp);
1682-}
1683-
1684-STBIDEF stbi_uc *
1685-stbi_load_from_callbacks(stbi_io_callbacks const *clbk, void *user, int *x,
1686-                         int *y, int *comp, int req_comp)
1687-{
1688-	stbi__context s;
1689-	stbi__start_callbacks(&s, (stbi_io_callbacks *)clbk, user);
1690-	return stbi__load_and_postprocess_8bit(&s, x, y, comp, req_comp);
1691-}
1692-
1693-#ifndef STBI_NO_GIF
1694-STBIDEF stbi_uc *
1695-stbi_load_gif_from_memory(stbi_uc const *buffer, int len, int **delays, int *x,
1696-                          int *y, int *z, int *comp, int req_comp)
1697-{
1698-	unsigned char *result;
1699-	stbi__context s;
1700-	stbi__start_mem(&s, buffer, len);
1701-
1702-	result = (unsigned char *)stbi__load_gif_main(&s, delays, x, y, z, comp,
1703-	                                              req_comp);
1704-	if (stbi__vertically_flip_on_load) {
1705-		stbi__vertical_flip_slices(result, *x, *y, *z, *comp);
1706-	}
1707-
1708-	return result;
1709-}
1710-#endif
1711-
1712-#ifndef STBI_NO_LINEAR
1713-static float *
1714-stbi__loadf_main(stbi__context *s, int *x, int *y, int *comp, int req_comp)
1715-{
1716-	unsigned char *data;
1717-#ifndef STBI_NO_HDR
1718-	if (stbi__hdr_test(s)) {
1719-		stbi__result_info ri;
1720-		float *hdr_data = stbi__hdr_load(s, x, y, comp, req_comp, &ri);
1721-		if (hdr_data) {
1722-			stbi__float_postprocess(hdr_data, x, y, comp, req_comp);
1723-		}
1724-		return hdr_data;
1725-	}
1726-#endif
1727-	data = stbi__load_and_postprocess_8bit(s, x, y, comp, req_comp);
1728-	if (data) {
1729-		return stbi__ldr_to_hdr(data, *x, *y, req_comp ? req_comp : *comp);
1730-	}
1731-	return stbi__errpf("unknown image type",
1732-	                   "Image not of any known type, or corrupt");
1733-}
1734-
1735-STBIDEF float *
1736-stbi_loadf_from_memory(stbi_uc const *buffer, int len, int *x, int *y,
1737-                       int *comp, int req_comp)
1738-{
1739-	stbi__context s;
1740-	stbi__start_mem(&s, buffer, len);
1741-	return stbi__loadf_main(&s, x, y, comp, req_comp);
1742-}
1743-
1744-STBIDEF float *
1745-stbi_loadf_from_callbacks(stbi_io_callbacks const *clbk, void *user, int *x,
1746-                          int *y, int *comp, int req_comp)
1747-{
1748-	stbi__context s;
1749-	stbi__start_callbacks(&s, (stbi_io_callbacks *)clbk, user);
1750-	return stbi__loadf_main(&s, x, y, comp, req_comp);
1751-}
1752-
1753-#ifndef STBI_NO_STDIO
1754-STBIDEF float *
1755-stbi_loadf(char const *filename, int *x, int *y, int *comp, int req_comp)
1756-{
1757-	float *result;
1758-	FILE *f = stbi__fopen(filename, "rb");
1759-	if (!f) {
1760-		return stbi__errpf("can't fopen", "Unable to open file");
1761-	}
1762-	result = stbi_loadf_from_file(f, x, y, comp, req_comp);
1763-	fclose(f);
1764-	return result;
1765-}
1766-
1767-STBIDEF float *
1768-stbi_loadf_from_file(FILE *f, int *x, int *y, int *comp, int req_comp)
1769-{
1770-	stbi__context s;
1771-	stbi__start_file(&s, f);
1772-	return stbi__loadf_main(&s, x, y, comp, req_comp);
1773-}
1774-#endif // !STBI_NO_STDIO
1775-
1776-#endif // !STBI_NO_LINEAR
1777-
1778-// these is-hdr-or-not is defined independent of whether STBI_NO_LINEAR is
1779-// defined, for API simplicity; if STBI_NO_LINEAR is defined, it always
1780-// reports false!
1781-
1782-STBIDEF int
1783-stbi_is_hdr_from_memory(stbi_uc const *buffer, int len)
1784-{
1785-#ifndef STBI_NO_HDR
1786-	stbi__context s;
1787-	stbi__start_mem(&s, buffer, len);
1788-	return stbi__hdr_test(&s);
1789-#else
1790-	STBI_NOTUSED(buffer);
1791-	STBI_NOTUSED(len);
1792-	return 0;
1793-#endif
1794-}
1795-
1796-#ifndef STBI_NO_STDIO
1797-STBIDEF int
1798-stbi_is_hdr(char const *filename)
1799-{
1800-	FILE *f = stbi__fopen(filename, "rb");
1801-	int result = 0;
1802-	if (f) {
1803-		result = stbi_is_hdr_from_file(f);
1804-		fclose(f);
1805-	}
1806-	return result;
1807-}
1808-
1809-STBIDEF int
1810-stbi_is_hdr_from_file(FILE *f)
1811-{
1812-#ifndef STBI_NO_HDR
1813-	long pos = ftell(f);
1814-	int res;
1815-	stbi__context s;
1816-	stbi__start_file(&s, f);
1817-	res = stbi__hdr_test(&s);
1818-	fseek(f, pos, SEEK_SET);
1819-	return res;
1820-#else
1821-	STBI_NOTUSED(f);
1822-	return 0;
1823-#endif
1824-}
1825-#endif // !STBI_NO_STDIO
1826-
1827-STBIDEF int
1828-stbi_is_hdr_from_callbacks(stbi_io_callbacks const *clbk, void *user)
1829-{
1830-#ifndef STBI_NO_HDR
1831-	stbi__context s;
1832-	stbi__start_callbacks(&s, (stbi_io_callbacks *)clbk, user);
1833-	return stbi__hdr_test(&s);
1834-#else
1835-	STBI_NOTUSED(clbk);
1836-	STBI_NOTUSED(user);
1837-	return 0;
1838-#endif
1839-}
1840-
1841-#ifndef STBI_NO_LINEAR
1842-static float stbi__l2h_gamma = 2.2f, stbi__l2h_scale = 1.0f;
1843-
1844-STBIDEF void
1845-stbi_ldr_to_hdr_gamma(float gamma)
1846-{
1847-	stbi__l2h_gamma = gamma;
1848-}
1849-STBIDEF void
1850-stbi_ldr_to_hdr_scale(float scale)
1851-{
1852-	stbi__l2h_scale = scale;
1853-}
1854-#endif
1855-
1856-static float stbi__h2l_gamma_i = 1.0f / 2.2f, stbi__h2l_scale_i = 1.0f;
1857-
1858-STBIDEF void
1859-stbi_hdr_to_ldr_gamma(float gamma)
1860-{
1861-	stbi__h2l_gamma_i = 1 / gamma;
1862-}
1863-STBIDEF void
1864-stbi_hdr_to_ldr_scale(float scale)
1865-{
1866-	stbi__h2l_scale_i = 1 / scale;
1867-}
1868-
1869-//////////////////////////////////////////////////////////////////////////////
1870-//
1871-// Common code used by all image loaders
1872-//
1873-
1874-enum { STBI__SCAN_load = 0, STBI__SCAN_type, STBI__SCAN_header };
1875-
1876-static void
1877-stbi__refill_buffer(stbi__context *s)
1878-{
1879-	int n = (s->io.read)(s->io_user_data, (char *)s->buffer_start, s->buflen);
1880-	s->callback_already_read += (int)(s->img_buffer - s->img_buffer_original);
1881-	if (n == 0) {
1882-		// at end of file, treat same as if from memory, but need to handle case
1883-		// where s->img_buffer isn't pointing to safe memory, e.g. 0-byte file
1884-		s->read_from_callbacks = 0;
1885-		s->img_buffer = s->buffer_start;
1886-		s->img_buffer_end = s->buffer_start + 1;
1887-		*s->img_buffer = 0;
1888-	} else {
1889-		s->img_buffer = s->buffer_start;
1890-		s->img_buffer_end = s->buffer_start + n;
1891-	}
1892-}
1893-
1894-stbi_inline static stbi_uc
1895-stbi__get8(stbi__context *s)
1896-{
1897-	if (s->img_buffer < s->img_buffer_end) {
1898-		return *s->img_buffer++;
1899-	}
1900-	if (s->read_from_callbacks) {
1901-		stbi__refill_buffer(s);
1902-		return *s->img_buffer++;
1903-	}
1904-	return 0;
1905-}
1906-
1907-#if defined(STBI_NO_JPEG) && defined(STBI_NO_HDR) && defined(STBI_NO_PIC) &&   \
1908-    defined(STBI_NO_PNM)
1909-// nothing
1910-#else
1911-stbi_inline static int
1912-stbi__at_eof(stbi__context *s)
1913-{
1914-	if (s->io.read) {
1915-		if (!(s->io.eof)(s->io_user_data)) {
1916-			return 0;
1917-		}
1918-		// if feof() is true, check if buffer = end
1919-		// special case: we've only got the special 0 character at the end
1920-		if (s->read_from_callbacks == 0) {
1921-			return 1;
1922-		}
1923-	}
1924-
1925-	return s->img_buffer >= s->img_buffer_end;
1926-}
1927-#endif
1928-
1929-#if defined(STBI_NO_JPEG) && defined(STBI_NO_PNG) && defined(STBI_NO_BMP) &&   \
1930-    defined(STBI_NO_PSD) && defined(STBI_NO_TGA) && defined(STBI_NO_GIF) &&    \
1931-    defined(STBI_NO_PIC)
1932-// nothing
1933-#else
1934-static void
1935-stbi__skip(stbi__context *s, int n)
1936-{
1937-	if (n == 0) {
1938-		return; // already there!
1939-	}
1940-	if (n < 0) {
1941-		s->img_buffer = s->img_buffer_end;
1942-		return;
1943-	}
1944-	if (s->io.read) {
1945-		int blen = (int)(s->img_buffer_end - s->img_buffer);
1946-		if (blen < n) {
1947-			s->img_buffer = s->img_buffer_end;
1948-			(s->io.skip)(s->io_user_data, n - blen);
1949-			return;
1950-		}
1951-	}
1952-	s->img_buffer += n;
1953-}
1954-#endif
1955-
1956-#if defined(STBI_NO_PNG) && defined(STBI_NO_TGA) && defined(STBI_NO_HDR) &&    \
1957-    defined(STBI_NO_PNM)
1958-// nothing
1959-#else
1960-static int
1961-stbi__getn(stbi__context *s, stbi_uc *buffer, int n)
1962-{
1963-	if (s->io.read) {
1964-		int blen = (int)(s->img_buffer_end - s->img_buffer);
1965-		if (blen < n) {
1966-			int res, count;
1967-
1968-			memcpy(buffer, s->img_buffer, blen);
1969-
1970-			count =
1971-			    (s->io.read)(s->io_user_data, (char *)buffer + blen, n - blen);
1972-			res = (count == (n - blen));
1973-			s->img_buffer = s->img_buffer_end;
1974-			return res;
1975-		}
1976-	}
1977-
1978-	if (s->img_buffer + n <= s->img_buffer_end) {
1979-		memcpy(buffer, s->img_buffer, n);
1980-		s->img_buffer += n;
1981-		return 1;
1982-	} else {
1983-		return 0;
1984-	}
1985-}
1986-#endif
1987-
1988-#if defined(STBI_NO_JPEG) && defined(STBI_NO_PNG) && defined(STBI_NO_PSD) &&   \
1989-    defined(STBI_NO_PIC)
1990-// nothing
1991-#else
1992-static int
1993-stbi__get16be(stbi__context *s)
1994-{
1995-	int z = stbi__get8(s);
1996-	return (z << 8) + stbi__get8(s);
1997-}
1998-#endif
1999-
2000-#if defined(STBI_NO_PNG) && defined(STBI_NO_PSD) && defined(STBI_NO_PIC)
2001-// nothing
2002-#else
2003-static stbi__uint32
2004-stbi__get32be(stbi__context *s)
2005-{
2006-	stbi__uint32 z = stbi__get16be(s);
2007-	return (z << 16) + stbi__get16be(s);
2008-}
2009-#endif
2010-
2011-#if defined(STBI_NO_BMP) && defined(STBI_NO_TGA) && defined(STBI_NO_GIF)
2012-// nothing
2013-#else
2014-static int
2015-stbi__get16le(stbi__context *s)
2016-{
2017-	int z = stbi__get8(s);
2018-	return z + (stbi__get8(s) << 8);
2019-}
2020-#endif
2021-
2022-#ifndef STBI_NO_BMP
2023-static stbi__uint32
2024-stbi__get32le(stbi__context *s)
2025-{
2026-	stbi__uint32 z = stbi__get16le(s);
2027-	z += (stbi__uint32)stbi__get16le(s) << 16;
2028-	return z;
2029-}
2030-#endif
2031-
2032-#define STBI__BYTECAST(x)                                                      \
2033-	((stbi_uc)((x) & 255)) // truncate int to byte without warnings
2034-
2035-#if defined(STBI_NO_JPEG) && defined(STBI_NO_PNG) && defined(STBI_NO_BMP) &&   \
2036-    defined(STBI_NO_PSD) && defined(STBI_NO_TGA) && defined(STBI_NO_GIF) &&    \
2037-    defined(STBI_NO_PIC) && defined(STBI_NO_PNM)
2038-// nothing
2039-#else
2040-//////////////////////////////////////////////////////////////////////////////
2041-//
2042-//  generic converter from built-in img_n to req_comp
2043-//    individual types do this automatically as much as possible (e.g. jpeg
2044-//    does all cases internally since it needs to colorspace convert anyway,
2045-//    and it never has alpha, so very few cases ). png can automatically
2046-//    interleave an alpha=255 channel, but falls back to this for other cases
2047-//
2048-//  assume data buffer is malloced, so malloc a new one and free that one
2049-//  only failure mode is malloc failing
2050-
2051-static stbi_uc
2052-stbi__compute_y(int r, int g, int b)
2053-{
2054-	return (stbi_uc)(((r * 77) + (g * 150) + (29 * b)) >> 8);
2055-}
2056-#endif
2057-
2058-#if defined(STBI_NO_PNG) && defined(STBI_NO_BMP) && defined(STBI_NO_PSD) &&    \
2059-    defined(STBI_NO_TGA) && defined(STBI_NO_GIF) && defined(STBI_NO_PIC) &&    \
2060-    defined(STBI_NO_PNM)
2061-// nothing
2062-#else
2063-static unsigned char *
2064-stbi__convert_format(unsigned char *data, int img_n, int req_comp,
2065-                     unsigned int x, unsigned int y)
2066-{
2067-	int i, j;
2068-	unsigned char *good;
2069-
2070-	if (req_comp == img_n) {
2071-		return data;
2072-	}
2073-	STBI_ASSERT(req_comp >= 1 && req_comp <= 4);
2074-
2075-	good = (unsigned char *)stbi__malloc_mad3(req_comp, x, y, 0);
2076-	if (good == NULL) {
2077-		STBI_FREE(data);
2078-		return stbi__errpuc("outofmem", "Out of memory");
2079-	}
2080-
2081-	for (j = 0; j < (int)y; ++j) {
2082-		unsigned char *src = data + j * x * img_n;
2083-		unsigned char *dest = good + j * x * req_comp;
2084-
2085-#define STBI__COMBO(a, b) ((a) * 8 + (b))
2086-#define STBI__CASE(a, b)                                                       \
2087-	case STBI__COMBO(a, b):                                                    \
2088-		for (i = x - 1; i >= 0; --i, src += a, dest += b)
2089-		// convert source image with img_n components to one with req_comp
2090-		// components; avoid switch per pixel, so use switch per scanline and
2091-		// massive macros
2092-		switch (STBI__COMBO(img_n, req_comp)) {
2093-			STBI__CASE(1, 2)
2094-			{
2095-				dest[0] = src[0];
2096-				dest[1] = 255;
2097-			}
2098-			break;
2099-			STBI__CASE(1, 3) { dest[0] = dest[1] = dest[2] = src[0]; }
2100-			break;
2101-			STBI__CASE(1, 4)
2102-			{
2103-				dest[0] = dest[1] = dest[2] = src[0];
2104-				dest[3] = 255;
2105-			}
2106-			break;
2107-			STBI__CASE(2, 1) { dest[0] = src[0]; }
2108-			break;
2109-			STBI__CASE(2, 3) { dest[0] = dest[1] = dest[2] = src[0]; }
2110-			break;
2111-			STBI__CASE(2, 4)
2112-			{
2113-				dest[0] = dest[1] = dest[2] = src[0];
2114-				dest[3] = src[1];
2115-			}
2116-			break;
2117-			STBI__CASE(3, 4)
2118-			{
2119-				dest[0] = src[0];
2120-				dest[1] = src[1];
2121-				dest[2] = src[2];
2122-				dest[3] = 255;
2123-			}
2124-			break;
2125-			STBI__CASE(3, 1)
2126-			{
2127-				dest[0] = stbi__compute_y(src[0], src[1], src[2]);
2128-			}
2129-			break;
2130-			STBI__CASE(3, 2)
2131-			{
2132-				dest[0] = stbi__compute_y(src[0], src[1], src[2]);
2133-				dest[1] = 255;
2134-			}
2135-			break;
2136-			STBI__CASE(4, 1)
2137-			{
2138-				dest[0] = stbi__compute_y(src[0], src[1], src[2]);
2139-			}
2140-			break;
2141-			STBI__CASE(4, 2)
2142-			{
2143-				dest[0] = stbi__compute_y(src[0], src[1], src[2]);
2144-				dest[1] = src[3];
2145-			}
2146-			break;
2147-			STBI__CASE(4, 3)
2148-			{
2149-				dest[0] = src[0];
2150-				dest[1] = src[1];
2151-				dest[2] = src[2];
2152-			}
2153-			break;
2154-		default:
2155-			STBI_ASSERT(0);
2156-			STBI_FREE(data);
2157-			STBI_FREE(good);
2158-			return stbi__errpuc("unsupported", "Unsupported format conversion");
2159-		}
2160-#undef STBI__CASE
2161-	}
2162-
2163-	STBI_FREE(data);
2164-	return good;
2165-}
2166-#endif
2167-
2168-#if defined(STBI_NO_PNG) && defined(STBI_NO_PSD)
2169-// nothing
2170-#else
2171-static stbi__uint16
2172-stbi__compute_y_16(int r, int g, int b)
2173-{
2174-	return (stbi__uint16)(((r * 77) + (g * 150) + (29 * b)) >> 8);
2175-}
2176-#endif
2177-
2178-#if defined(STBI_NO_PNG) && defined(STBI_NO_PSD)
2179-// nothing
2180-#else
2181-static stbi__uint16 *
2182-stbi__convert_format16(stbi__uint16 *data, int img_n, int req_comp,
2183-                       unsigned int x, unsigned int y)
2184-{
2185-	int i, j;
2186-	stbi__uint16 *good;
2187-
2188-	if (req_comp == img_n) {
2189-		return data;
2190-	}
2191-	STBI_ASSERT(req_comp >= 1 && req_comp <= 4);
2192-
2193-	good = (stbi__uint16 *)stbi__malloc(req_comp * x * y * 2);
2194-	if (good == NULL) {
2195-		STBI_FREE(data);
2196-		return (stbi__uint16 *)stbi__errpuc("outofmem", "Out of memory");
2197-	}
2198-
2199-	for (j = 0; j < (int)y; ++j) {
2200-		stbi__uint16 *src = data + j * x * img_n;
2201-		stbi__uint16 *dest = good + j * x * req_comp;
2202-
2203-#define STBI__COMBO(a, b) ((a) * 8 + (b))
2204-#define STBI__CASE(a, b)                                                       \
2205-	case STBI__COMBO(a, b):                                                    \
2206-		for (i = x - 1; i >= 0; --i, src += a, dest += b)
2207-		// convert source image with img_n components to one with req_comp
2208-		// components; avoid switch per pixel, so use switch per scanline and
2209-		// massive macros
2210-		switch (STBI__COMBO(img_n, req_comp)) {
2211-			STBI__CASE(1, 2)
2212-			{
2213-				dest[0] = src[0];
2214-				dest[1] = 0xffff;
2215-			}
2216-			break;
2217-			STBI__CASE(1, 3) { dest[0] = dest[1] = dest[2] = src[0]; }
2218-			break;
2219-			STBI__CASE(1, 4)
2220-			{
2221-				dest[0] = dest[1] = dest[2] = src[0];
2222-				dest[3] = 0xffff;
2223-			}
2224-			break;
2225-			STBI__CASE(2, 1) { dest[0] = src[0]; }
2226-			break;
2227-			STBI__CASE(2, 3) { dest[0] = dest[1] = dest[2] = src[0]; }
2228-			break;
2229-			STBI__CASE(2, 4)
2230-			{
2231-				dest[0] = dest[1] = dest[2] = src[0];
2232-				dest[3] = src[1];
2233-			}
2234-			break;
2235-			STBI__CASE(3, 4)
2236-			{
2237-				dest[0] = src[0];
2238-				dest[1] = src[1];
2239-				dest[2] = src[2];
2240-				dest[3] = 0xffff;
2241-			}
2242-			break;
2243-			STBI__CASE(3, 1)
2244-			{
2245-				dest[0] = stbi__compute_y_16(src[0], src[1], src[2]);
2246-			}
2247-			break;
2248-			STBI__CASE(3, 2)
2249-			{
2250-				dest[0] = stbi__compute_y_16(src[0], src[1], src[2]);
2251-				dest[1] = 0xffff;
2252-			}
2253-			break;
2254-			STBI__CASE(4, 1)
2255-			{
2256-				dest[0] = stbi__compute_y_16(src[0], src[1], src[2]);
2257-			}
2258-			break;
2259-			STBI__CASE(4, 2)
2260-			{
2261-				dest[0] = stbi__compute_y_16(src[0], src[1], src[2]);
2262-				dest[1] = src[3];
2263-			}
2264-			break;
2265-			STBI__CASE(4, 3)
2266-			{
2267-				dest[0] = src[0];
2268-				dest[1] = src[1];
2269-				dest[2] = src[2];
2270-			}
2271-			break;
2272-		default:
2273-			STBI_ASSERT(0);
2274-			STBI_FREE(data);
2275-			STBI_FREE(good);
2276-			return (stbi__uint16 *)stbi__errpuc(
2277-			    "unsupported", "Unsupported format conversion");
2278-		}
2279-#undef STBI__CASE
2280-	}
2281-
2282-	STBI_FREE(data);
2283-	return good;
2284-}
2285-#endif
2286-
2287-#ifndef STBI_NO_LINEAR
2288-static float *
2289-stbi__ldr_to_hdr(stbi_uc *data, int x, int y, int comp)
2290-{
2291-	int i, k, n;
2292-	float *output;
2293-	if (!data) {
2294-		return NULL;
2295-	}
2296-	output = (float *)stbi__malloc_mad4(x, y, comp, sizeof(float), 0);
2297-	if (output == NULL) {
2298-		STBI_FREE(data);
2299-		return stbi__errpf("outofmem", "Out of memory");
2300-	}
2301-	// compute number of non-alpha components
2302-	if (comp & 1) {
2303-		n = comp;
2304-	} else {
2305-		n = comp - 1;
2306-	}
2307-	for (i = 0; i < x * y; ++i) {
2308-		for (k = 0; k < n; ++k) {
2309-			output[i * comp + k] =
2310-			    (float)(pow(data[i * comp + k] / 255.0f, stbi__l2h_gamma) *
2311-			            stbi__l2h_scale);
2312-		}
2313-	}
2314-	if (n < comp) {
2315-		for (i = 0; i < x * y; ++i) {
2316-			output[i * comp + n] = data[i * comp + n] / 255.0f;
2317-		}
2318-	}
2319-	STBI_FREE(data);
2320-	return output;
2321-}
2322-#endif
2323-
2324-#ifndef STBI_NO_HDR
2325-#define stbi__float2int(x) ((int)(x))
2326-static stbi_uc *
2327-stbi__hdr_to_ldr(float *data, int x, int y, int comp)
2328-{
2329-	int i, k, n;
2330-	stbi_uc *output;
2331-	if (!data) {
2332-		return NULL;
2333-	}
2334-	output = (stbi_uc *)stbi__malloc_mad3(x, y, comp, 0);
2335-	if (output == NULL) {
2336-		STBI_FREE(data);
2337-		return stbi__errpuc("outofmem", "Out of memory");
2338-	}
2339-	// compute number of non-alpha components
2340-	if (comp & 1) {
2341-		n = comp;
2342-	} else {
2343-		n = comp - 1;
2344-	}
2345-	for (i = 0; i < x * y; ++i) {
2346-		for (k = 0; k < n; ++k) {
2347-			float z = (float)pow(data[i * comp + k] * stbi__h2l_scale_i,
2348-			                     stbi__h2l_gamma_i) *
2349-			              255 +
2350-			          0.5f;
2351-			if (z < 0) {
2352-				z = 0;
2353-			}
2354-			if (z > 255) {
2355-				z = 255;
2356-			}
2357-			output[i * comp + k] = (stbi_uc)stbi__float2int(z);
2358-		}
2359-		if (k < comp) {
2360-			float z = data[i * comp + k] * 255 + 0.5f;
2361-			if (z < 0) {
2362-				z = 0;
2363-			}
2364-			if (z > 255) {
2365-				z = 255;
2366-			}
2367-			output[i * comp + k] = (stbi_uc)stbi__float2int(z);
2368-		}
2369-	}
2370-	STBI_FREE(data);
2371-	return output;
2372-}
2373-#endif
2374-
2375-//////////////////////////////////////////////////////////////////////////////
2376-//
2377-//  "baseline" JPEG/JFIF decoder
2378-//
2379-//    simple implementation
2380-//      - doesn't support delayed output of y-dimension
2381-//      - simple interface (only one output format: 8-bit interleaved RGB)
2382-//      - doesn't try to recover corrupt jpegs
2383-//      - doesn't allow partial loading, loading multiple at once
2384-//      - still fast on x86 (copying globals into locals doesn't help x86)
2385-//      - allocates lots of intermediate memory (full size of all components)
2386-//        - non-interleaved case requires this anyway
2387-//        - allows good upsampling (see next)
2388-//    high-quality
2389-//      - upsampled channels are bilinearly interpolated, even across blocks
2390-//      - quality integer IDCT derived from IJG's 'slow'
2391-//    performance
2392-//      - fast huffman; reasonable integer IDCT
2393-//      - some SIMD kernels for common paths on targets with SSE2/NEON
2394-//      - uses a lot of intermediate memory, could cache poorly
2395-
2396-#ifndef STBI_NO_JPEG
2397-
2398-// huffman decoding acceleration
2399-#define FAST_BITS 9 // larger handles more cases; smaller stomps less cache
2400-
2401-typedef struct {
2402-	stbi_uc fast[1 << FAST_BITS];
2403-	// weirdly, repacking this into AoS is a 10% speed loss, instead of a win
2404-	stbi__uint16 code[256];
2405-	stbi_uc values[256];
2406-	stbi_uc size[257];
2407-	unsigned int maxcode[18];
2408-	int delta[17]; // old 'firstsymbol' - old 'firstcode'
2409-} stbi__huffman;
2410-
2411-typedef struct {
2412-	stbi__context *s;
2413-	stbi__huffman huff_dc[4];
2414-	stbi__huffman huff_ac[4];
2415-	stbi__uint16 dequant[4][64];
2416-	stbi__int16 fast_ac[4][1 << FAST_BITS];
2417-
2418-	// sizes for components, interleaved MCUs
2419-	int img_h_max, img_v_max;
2420-	int img_mcu_x, img_mcu_y;
2421-	int img_mcu_w, img_mcu_h;
2422-
2423-	// definition of jpeg image component
2424-	struct {
2425-		int id;
2426-		int h, v;
2427-		int tq;
2428-		int hd, ha;
2429-		int dc_pred;
2430-
2431-		int x, y, w2, h2;
2432-		stbi_uc *data;
2433-		void *raw_data, *raw_coeff;
2434-		stbi_uc *linebuf;
2435-		short *coeff;         // progressive only
2436-		int coeff_w, coeff_h; // number of 8x8 coefficient blocks
2437-	} img_comp[4];
2438-
2439-	stbi__uint32 code_buffer; // jpeg entropy-coded buffer
2440-	int code_bits;            // number of valid bits
2441-	unsigned char marker;     // marker seen while filling entropy buffer
2442-	int nomore;               // flag if we saw a marker so must stop
2443-
2444-	int progressive;
2445-	int spec_start;
2446-	int spec_end;
2447-	int succ_high;
2448-	int succ_low;
2449-	int eob_run;
2450-	int jfif;
2451-	int app14_color_transform; // Adobe APP14 tag
2452-	int rgb;
2453-
2454-	int scan_n, order[4];
2455-	int restart_interval, todo;
2456-
2457-	// kernels
2458-	void (*idct_block_kernel)(stbi_uc *out, int out_stride, short data[64]);
2459-	void (*YCbCr_to_RGB_kernel)(stbi_uc *out, const stbi_uc *y,
2460-	                            const stbi_uc *pcb, const stbi_uc *pcr,
2461-	                            int count, int step);
2462-	stbi_uc *(*resample_row_hv_2_kernel)(stbi_uc *out, stbi_uc *in_near,
2463-	                                     stbi_uc *in_far, int w, int hs);
2464-} stbi__jpeg;
2465-
2466-static int
2467-stbi__build_huffman(stbi__huffman *h, int *count)
2468-{
2469-	int i, j, k = 0;
2470-	unsigned int code;
2471-	// build size list for each symbol (from JPEG spec)
2472-	for (i = 0; i < 16; ++i) {
2473-		for (j = 0; j < count[i]; ++j) {
2474-			h->size[k++] = (stbi_uc)(i + 1);
2475-			if (k >= 257) {
2476-				return stbi__err("bad size list", "Corrupt JPEG");
2477-			}
2478-		}
2479-	}
2480-	h->size[k] = 0;
2481-
2482-	// compute actual symbols (from jpeg spec)
2483-	code = 0;
2484-	k = 0;
2485-	for (j = 1; j <= 16; ++j) {
2486-		// compute delta to add to code to compute symbol id
2487-		h->delta[j] = k - code;
2488-		if (h->size[k] == j) {
2489-			while (h->size[k] == j) {
2490-				h->code[k++] = (stbi__uint16)(code++);
2491-			}
2492-			if (code - 1 >= (1u << j)) {
2493-				return stbi__err("bad code lengths", "Corrupt JPEG");
2494-			}
2495-		}
2496-		// compute largest code + 1 for this size, preshifted as needed later
2497-		h->maxcode[j] = code << (16 - j);
2498-		code <<= 1;
2499-	}
2500-	h->maxcode[j] = 0xffffffff;
2501-
2502-	// build non-spec acceleration table; 255 is flag for not-accelerated
2503-	memset(h->fast, 255, 1 << FAST_BITS);
2504-	for (i = 0; i < k; ++i) {
2505-		int s = h->size[i];
2506-		if (s <= FAST_BITS) {
2507-			int c = h->code[i] << (FAST_BITS - s);
2508-			int m = 1 << (FAST_BITS - s);
2509-			for (j = 0; j < m; ++j) {
2510-				h->fast[c + j] = (stbi_uc)i;
2511-			}
2512-		}
2513-	}
2514-	return 1;
2515-}
2516-
2517-// build a table that decodes both magnitude and value of small ACs in
2518-// one go.
2519-static void
2520-stbi__build_fast_ac(stbi__int16 *fast_ac, stbi__huffman *h)
2521-{
2522-	int i;
2523-	for (i = 0; i < (1 << FAST_BITS); ++i) {
2524-		stbi_uc fast = h->fast[i];
2525-		fast_ac[i] = 0;
2526-		if (fast < 255) {
2527-			int rs = h->values[fast];
2528-			int run = (rs >> 4) & 15;
2529-			int magbits = rs & 15;
2530-			int len = h->size[fast];
2531-
2532-			if (magbits && len + magbits <= FAST_BITS) {
2533-				// magnitude code followed by receive_extend code
2534-				int k = ((i << len) & ((1 << FAST_BITS) - 1)) >>
2535-				        (FAST_BITS - magbits);
2536-				int m = 1 << (magbits - 1);
2537-				if (k < m) {
2538-					k += (~0U << magbits) + 1;
2539-				}
2540-				// if the result is small enough, we can fit it in fast_ac table
2541-				if (k >= -128 && k <= 127) {
2542-					fast_ac[i] =
2543-					    (stbi__int16)((k * 256) + (run * 16) + (len + magbits));
2544-				}
2545-			}
2546-		}
2547-	}
2548-}
2549-
2550-static void
2551-stbi__grow_buffer_unsafe(stbi__jpeg *j)
2552-{
2553-	do {
2554-		unsigned int b = j->nomore ? 0 : stbi__get8(j->s);
2555-		if (b == 0xff) {
2556-			int c = stbi__get8(j->s);
2557-			while (c == 0xff) {
2558-				c = stbi__get8(j->s); // consume fill bytes
2559-			}
2560-			if (c != 0) {
2561-				j->marker = (unsigned char)c;
2562-				j->nomore = 1;
2563-				return;
2564-			}
2565-		}
2566-		j->code_buffer |= b << (24 - j->code_bits);
2567-		j->code_bits += 8;
2568-	} while (j->code_bits <= 24);
2569-}
2570-
2571-// (1 << n) - 1
2572-static const stbi__uint32 stbi__bmask[17] = {
2573-    0,   1,    3,    7,    15,   31,    63,    127,  255,
2574-    511, 1023, 2047, 4095, 8191, 16383, 32767, 65535};
2575-
2576-// decode a jpeg huffman value from the bitstream
2577-stbi_inline static int
2578-stbi__jpeg_huff_decode(stbi__jpeg *j, stbi__huffman *h)
2579-{
2580-	unsigned int temp;
2581-	int c, k;
2582-
2583-	if (j->code_bits < 16) {
2584-		stbi__grow_buffer_unsafe(j);
2585-	}
2586-
2587-	// look at the top FAST_BITS and determine what symbol ID it is,
2588-	// if the code is <= FAST_BITS
2589-	c = (j->code_buffer >> (32 - FAST_BITS)) & ((1 << FAST_BITS) - 1);
2590-	k = h->fast[c];
2591-	if (k < 255) {
2592-		int s = h->size[k];
2593-		if (s > j->code_bits) {
2594-			return -1;
2595-		}
2596-		j->code_buffer <<= s;
2597-		j->code_bits -= s;
2598-		return h->values[k];
2599-	}
2600-
2601-	// naive test is to shift the code_buffer down so k bits are
2602-	// valid, then test against maxcode. To speed this up, we've
2603-	// preshifted maxcode left so that it has (16-k) 0s at the
2604-	// end; in other words, regardless of the number of bits, it
2605-	// wants to be compared against something shifted to have 16;
2606-	// that way we don't need to shift inside the loop.
2607-	temp = j->code_buffer >> 16;
2608-	for (k = FAST_BITS + 1;; ++k) {
2609-		if (temp < h->maxcode[k]) {
2610-			break;
2611-		}
2612-	}
2613-	if (k == 17) {
2614-		// error! code not found
2615-		j->code_bits -= 16;
2616-		return -1;
2617-	}
2618-
2619-	if (k > j->code_bits) {
2620-		return -1;
2621-	}
2622-
2623-	// convert the huffman code to the symbol id
2624-	c = ((j->code_buffer >> (32 - k)) & stbi__bmask[k]) + h->delta[k];
2625-	if (c < 0 || c >= 256) { // symbol id out of bounds!
2626-		return -1;
2627-	}
2628-	STBI_ASSERT((((j->code_buffer) >> (32 - h->size[c])) &
2629-	             stbi__bmask[h->size[c]]) == h->code[c]);
2630-
2631-	// convert the id to a symbol
2632-	j->code_bits -= k;
2633-	j->code_buffer <<= k;
2634-	return h->values[c];
2635-}
2636-
2637-// bias[n] = (-1<<n) + 1
2638-static const int stbi__jbias[16] = {0,     -1,    -3,     -7,    -15,   -31,
2639-                                    -63,   -127,  -255,   -511,  -1023, -2047,
2640-                                    -4095, -8191, -16383, -32767};
2641-
2642-// combined JPEG 'receive' and JPEG 'extend', since baseline
2643-// always extends everything it receives.
2644-stbi_inline static int
2645-stbi__extend_receive(stbi__jpeg *j, int n)
2646-{
2647-	unsigned int k;
2648-	int sgn;
2649-	if (j->code_bits < n) {
2650-		stbi__grow_buffer_unsafe(j);
2651-	}
2652-	if (j->code_bits < n) {
2653-		return 0; // ran out of bits from stream, return 0s intead of continuing
2654-	}
2655-
2656-	sgn = j->code_buffer >> 31; // sign bit always in MSB; 0 if MSB clear
2657-	                            // (positive), 1 if MSB set (negative)
2658-	k = stbi_lrot(j->code_buffer, n);
2659-	j->code_buffer = k & ~stbi__bmask[n];
2660-	k &= stbi__bmask[n];
2661-	j->code_bits -= n;
2662-	return k + (stbi__jbias[n] & (sgn - 1));
2663-}
2664-
2665-// get some unsigned bits
2666-stbi_inline static int
2667-stbi__jpeg_get_bits(stbi__jpeg *j, int n)
2668-{
2669-	unsigned int k;
2670-	if (j->code_bits < n) {
2671-		stbi__grow_buffer_unsafe(j);
2672-	}
2673-	if (j->code_bits < n) {
2674-		return 0; // ran out of bits from stream, return 0s intead of continuing
2675-	}
2676-	k = stbi_lrot(j->code_buffer, n);
2677-	j->code_buffer = k & ~stbi__bmask[n];
2678-	k &= stbi__bmask[n];
2679-	j->code_bits -= n;
2680-	return k;
2681-}
2682-
2683-stbi_inline static int
2684-stbi__jpeg_get_bit(stbi__jpeg *j)
2685-{
2686-	unsigned int k;
2687-	if (j->code_bits < 1) {
2688-		stbi__grow_buffer_unsafe(j);
2689-	}
2690-	if (j->code_bits < 1) {
2691-		return 0; // ran out of bits from stream, return 0s intead of continuing
2692-	}
2693-	k = j->code_buffer;
2694-	j->code_buffer <<= 1;
2695-	--j->code_bits;
2696-	return k & 0x80000000;
2697-}
2698-
2699-// given a value that's at position X in the zigzag stream,
2700-// where does it appear in the 8x8 matrix coded as row-major?
2701-static const stbi_uc stbi__jpeg_dezigzag[64 + 15] = {
2702-    0, 1, 8, 16, 9, 2, 3, 10, 17, 24, 32, 25, 18, 11, 4, 5, 12, 19, 26, 33, 40,
2703-    48, 41, 34, 27, 20, 13, 6, 7, 14, 21, 28, 35, 42, 49, 56, 57, 50, 43, 36,
2704-    29, 22, 15, 23, 30, 37, 44, 51, 58, 59, 52, 45, 38, 31, 39, 46, 53, 60, 61,
2705-    54, 47, 55, 62, 63,
2706-    // let corrupt input sample past end
2707-    63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63};
2708-
2709-// decode one 64-entry block--
2710-static int
2711-stbi__jpeg_decode_block(stbi__jpeg *j, short data[64], stbi__huffman *hdc,
2712-                        stbi__huffman *hac, stbi__int16 *fac, int b,
2713-                        stbi__uint16 *dequant)
2714-{
2715-	int diff, dc, k;
2716-	int t;
2717-
2718-	if (j->code_bits < 16) {
2719-		stbi__grow_buffer_unsafe(j);
2720-	}
2721-	t = stbi__jpeg_huff_decode(j, hdc);
2722-	if (t < 0 || t > 15) {
2723-		return stbi__err("bad huffman code", "Corrupt JPEG");
2724-	}
2725-
2726-	// 0 all the ac values now so we can do it 32-bits at a time
2727-	memset(data, 0, 64 * sizeof(data[0]));
2728-
2729-	diff = t ? stbi__extend_receive(j, t) : 0;
2730-	if (!stbi__addints_valid(j->img_comp[b].dc_pred, diff)) {
2731-		return stbi__err("bad delta", "Corrupt JPEG");
2732-	}
2733-	dc = j->img_comp[b].dc_pred + diff;
2734-	j->img_comp[b].dc_pred = dc;
2735-	if (!stbi__mul2shorts_valid(dc, dequant[0])) {
2736-		return stbi__err("can't merge dc and ac", "Corrupt JPEG");
2737-	}
2738-	data[0] = (short)(dc * dequant[0]);
2739-
2740-	// decode AC components, see JPEG spec
2741-	k = 1;
2742-	do {
2743-		unsigned int zig;
2744-		int c, r, s;
2745-		if (j->code_bits < 16) {
2746-			stbi__grow_buffer_unsafe(j);
2747-		}
2748-		c = (j->code_buffer >> (32 - FAST_BITS)) & ((1 << FAST_BITS) - 1);
2749-		r = fac[c];
2750-		if (r) {                // fast-AC path
2751-			k += (r >> 4) & 15; // run
2752-			s = r & 15;         // combined length
2753-			if (s > j->code_bits) {
2754-				return stbi__err(
2755-				    "bad huffman code",
2756-				    "Combined length longer than code bits available");
2757-			}
2758-			j->code_buffer <<= s;
2759-			j->code_bits -= s;
2760-			// decode into unzigzag'd location
2761-			zig = stbi__jpeg_dezigzag[k++];
2762-			data[zig] = (short)((r >> 8) * dequant[zig]);
2763-		} else {
2764-			int rs = stbi__jpeg_huff_decode(j, hac);
2765-			if (rs < 0) {
2766-				return stbi__err("bad huffman code", "Corrupt JPEG");
2767-			}
2768-			s = rs & 15;
2769-			r = rs >> 4;
2770-			if (s == 0) {
2771-				if (rs != 0xf0) {
2772-					break; // end block
2773-				}
2774-				k += 16;
2775-			} else {
2776-				k += r;
2777-				// decode into unzigzag'd location
2778-				zig = stbi__jpeg_dezigzag[k++];
2779-				data[zig] = (short)(stbi__extend_receive(j, s) * dequant[zig]);
2780-			}
2781-		}
2782-	} while (k < 64);
2783-	return 1;
2784-}
2785-
2786-static int
2787-stbi__jpeg_decode_block_prog_dc(stbi__jpeg *j, short data[64],
2788-                                stbi__huffman *hdc, int b)
2789-{
2790-	int diff, dc;
2791-	int t;
2792-	if (j->spec_end != 0) {
2793-		return stbi__err("can't merge dc and ac", "Corrupt JPEG");
2794-	}
2795-
2796-	if (j->code_bits < 16) {
2797-		stbi__grow_buffer_unsafe(j);
2798-	}
2799-
2800-	if (j->succ_high == 0) {
2801-		// first scan for DC coefficient, must be first
2802-		memset(data, 0, 64 * sizeof(data[0])); // 0 all the ac values now
2803-		t = stbi__jpeg_huff_decode(j, hdc);
2804-		if (t < 0 || t > 15) {
2805-			return stbi__err("can't merge dc and ac", "Corrupt JPEG");
2806-		}
2807-		diff = t ? stbi__extend_receive(j, t) : 0;
2808-
2809-		if (!stbi__addints_valid(j->img_comp[b].dc_pred, diff)) {
2810-			return stbi__err("bad delta", "Corrupt JPEG");
2811-		}
2812-		dc = j->img_comp[b].dc_pred + diff;
2813-		j->img_comp[b].dc_pred = dc;
2814-		if (!stbi__mul2shorts_valid(dc, 1 << j->succ_low)) {
2815-			return stbi__err("can't merge dc and ac", "Corrupt JPEG");
2816-		}
2817-		data[0] = (short)(dc * (1 << j->succ_low));
2818-	} else {
2819-		// refinement scan for DC coefficient
2820-		if (stbi__jpeg_get_bit(j)) {
2821-			data[0] += (short)(1 << j->succ_low);
2822-		}
2823-	}
2824-	return 1;
2825-}
2826-
2827-// @OPTIMIZE: store non-zigzagged during the decode passes,
2828-// and only de-zigzag when dequantizing
2829-static int
2830-stbi__jpeg_decode_block_prog_ac(stbi__jpeg *j, short data[64],
2831-                                stbi__huffman *hac, stbi__int16 *fac)
2832-{
2833-	int k;
2834-	if (j->spec_start == 0) {
2835-		return stbi__err("can't merge dc and ac", "Corrupt JPEG");
2836-	}
2837-
2838-	if (j->succ_high == 0) {
2839-		int shift = j->succ_low;
2840-
2841-		if (j->eob_run) {
2842-			--j->eob_run;
2843-			return 1;
2844-		}
2845-
2846-		k = j->spec_start;
2847-		do {
2848-			unsigned int zig;
2849-			int c, r, s;
2850-			if (j->code_bits < 16) {
2851-				stbi__grow_buffer_unsafe(j);
2852-			}
2853-			c = (j->code_buffer >> (32 - FAST_BITS)) & ((1 << FAST_BITS) - 1);
2854-			r = fac[c];
2855-			if (r) {                // fast-AC path
2856-				k += (r >> 4) & 15; // run
2857-				s = r & 15;         // combined length
2858-				if (s > j->code_bits) {
2859-					return stbi__err(
2860-					    "bad huffman code",
2861-					    "Combined length longer than code bits available");
2862-				}
2863-				j->code_buffer <<= s;
2864-				j->code_bits -= s;
2865-				zig = stbi__jpeg_dezigzag[k++];
2866-				data[zig] = (short)((r >> 8) * (1 << shift));
2867-			} else {
2868-				int rs = stbi__jpeg_huff_decode(j, hac);
2869-				if (rs < 0) {
2870-					return stbi__err("bad huffman code", "Corrupt JPEG");
2871-				}
2872-				s = rs & 15;
2873-				r = rs >> 4;
2874-				if (s == 0) {
2875-					if (r < 15) {
2876-						j->eob_run = (1 << r);
2877-						if (r) {
2878-							j->eob_run += stbi__jpeg_get_bits(j, r);
2879-						}
2880-						--j->eob_run;
2881-						break;
2882-					}
2883-					k += 16;
2884-				} else {
2885-					k += r;
2886-					zig = stbi__jpeg_dezigzag[k++];
2887-					data[zig] =
2888-					    (short)(stbi__extend_receive(j, s) * (1 << shift));
2889-				}
2890-			}
2891-		} while (k <= j->spec_end);
2892-	} else {
2893-		// refinement scan for these AC coefficients
2894-
2895-		short bit = (short)(1 << j->succ_low);
2896-
2897-		if (j->eob_run) {
2898-			--j->eob_run;
2899-			for (k = j->spec_start; k <= j->spec_end; ++k) {
2900-				short *p = &data[stbi__jpeg_dezigzag[k]];
2901-				if (*p != 0) {
2902-					if (stbi__jpeg_get_bit(j)) {
2903-						if ((*p & bit) == 0) {
2904-							if (*p > 0) {
2905-								*p += bit;
2906-							} else {
2907-								*p -= bit;
2908-							}
2909-						}
2910-					}
2911-				}
2912-			}
2913-		} else {
2914-			k = j->spec_start;
2915-			do {
2916-				int r, s;
2917-				int rs = stbi__jpeg_huff_decode(
2918-				    j, hac); // @OPTIMIZE see if we can use the fast path here,
2919-				             // advance-by-r is so slow, eh
2920-				if (rs < 0) {
2921-					return stbi__err("bad huffman code", "Corrupt JPEG");
2922-				}
2923-				s = rs & 15;
2924-				r = rs >> 4;
2925-				if (s == 0) {
2926-					if (r < 15) {
2927-						j->eob_run = (1 << r) - 1;
2928-						if (r) {
2929-							j->eob_run += stbi__jpeg_get_bits(j, r);
2930-						}
2931-						r = 64; // force end of block
2932-					} else {
2933-						// r=15 s=0 should write 16 0s, so we just do
2934-						// a run of 15 0s and then write s (which is 0),
2935-						// so we don't have to do anything special here
2936-					}
2937-				} else {
2938-					if (s != 1) {
2939-						return stbi__err("bad huffman code", "Corrupt JPEG");
2940-					}
2941-					// sign bit
2942-					if (stbi__jpeg_get_bit(j)) {
2943-						s = bit;
2944-					} else {
2945-						s = -bit;
2946-					}
2947-				}
2948-
2949-				// advance by r
2950-				while (k <= j->spec_end) {
2951-					short *p = &data[stbi__jpeg_dezigzag[k++]];
2952-					if (*p != 0) {
2953-						if (stbi__jpeg_get_bit(j)) {
2954-							if ((*p & bit) == 0) {
2955-								if (*p > 0) {
2956-									*p += bit;
2957-								} else {
2958-									*p -= bit;
2959-								}
2960-							}
2961-						}
2962-					} else {
2963-						if (r == 0) {
2964-							*p = (short)s;
2965-							break;
2966-						}
2967-						--r;
2968-					}
2969-				}
2970-			} while (k <= j->spec_end);
2971-		}
2972-	}
2973-	return 1;
2974-}
2975-
2976-// take a -128..127 value and stbi__clamp it and convert to 0..255
2977-stbi_inline static stbi_uc
2978-stbi__clamp(int x)
2979-{
2980-	// trick to use a single test to catch both cases
2981-	if ((unsigned int)x > 255) {
2982-		if (x < 0) {
2983-			return 0;
2984-		}
2985-		if (x > 255) {
2986-			return 255;
2987-		}
2988-	}
2989-	return (stbi_uc)x;
2990-}
2991-
2992-#define stbi__f2f(x) ((int)(((x) * 4096 + 0.5)))
2993-#define stbi__fsh(x) ((x) * 4096)
2994-
2995-// derived from jidctint -- DCT_ISLOW
2996-#define STBI__IDCT_1D(s0, s1, s2, s3, s4, s5, s6, s7)                          \
2997-	int t0, t1, t2, t3, p1, p2, p3, p4, p5, x0, x1, x2, x3;                    \
2998-	p2 = s2;                                                                   \
2999-	p3 = s6;                                                                   \
3000-	p1 = (p2 + p3) * stbi__f2f(0.5411961f);                                    \
3001-	t2 = p1 + p3 * stbi__f2f(-1.847759065f);                                   \
3002-	t3 = p1 + p2 * stbi__f2f(0.765366865f);                                    \
3003-	p2 = s0;                                                                   \
3004-	p3 = s4;                                                                   \
3005-	t0 = stbi__fsh(p2 + p3);                                                   \
3006-	t1 = stbi__fsh(p2 - p3);                                                   \
3007-	x0 = t0 + t3;                                                              \
3008-	x3 = t0 - t3;                                                              \
3009-	x1 = t1 + t2;                                                              \
3010-	x2 = t1 - t2;                                                              \
3011-	t0 = s7;                                                                   \
3012-	t1 = s5;                                                                   \
3013-	t2 = s3;                                                                   \
3014-	t3 = s1;                                                                   \
3015-	p3 = t0 + t2;                                                              \
3016-	p4 = t1 + t3;                                                              \
3017-	p1 = t0 + t3;                                                              \
3018-	p2 = t1 + t2;                                                              \
3019-	p5 = (p3 + p4) * stbi__f2f(1.175875602f);                                  \
3020-	t0 = t0 * stbi__f2f(0.298631336f);                                         \
3021-	t1 = t1 * stbi__f2f(2.053119869f);                                         \
3022-	t2 = t2 * stbi__f2f(3.072711026f);                                         \
3023-	t3 = t3 * stbi__f2f(1.501321110f);                                         \
3024-	p1 = p5 + p1 * stbi__f2f(-0.899976223f);                                   \
3025-	p2 = p5 + p2 * stbi__f2f(-2.562915447f);                                   \
3026-	p3 = p3 * stbi__f2f(-1.961570560f);                                        \
3027-	p4 = p4 * stbi__f2f(-0.390180644f);                                        \
3028-	t3 += p1 + p4;                                                             \
3029-	t2 += p2 + p3;                                                             \
3030-	t1 += p2 + p4;                                                             \
3031-	t0 += p1 + p3;
3032-
3033-static void
3034-stbi__idct_block(stbi_uc *out, int out_stride, short data[64])
3035-{
3036-	int i, val[64], *v = val;
3037-	stbi_uc *o;
3038-	short *d = data;
3039-
3040-	// columns
3041-	for (i = 0; i < 8; ++i, ++d, ++v) {
3042-		// if all zeroes, shortcut -- this avoids dequantizing 0s and IDCTing
3043-		if (d[8] == 0 && d[16] == 0 && d[24] == 0 && d[32] == 0 && d[40] == 0 &&
3044-		    d[48] == 0 && d[56] == 0) {
3045-			//    no shortcut                 0     seconds
3046-			//    (1|2|3|4|5|6|7)==0          0     seconds
3047-			//    all separate               -0.047 seconds
3048-			//    1 && 2|3 && 4|5 && 6|7:    -0.047 seconds
3049-			int dcterm = d[0] * 4;
3050-			v[0] = v[8] = v[16] = v[24] = v[32] = v[40] = v[48] = v[56] =
3051-			    dcterm;
3052-		} else {
3053-			STBI__IDCT_1D(d[0], d[8], d[16], d[24], d[32], d[40], d[48], d[56])
3054-			// constants scaled things up by 1<<12; let's bring them back
3055-			// down, but keep 2 extra bits of precision
3056-			x0 += 512;
3057-			x1 += 512;
3058-			x2 += 512;
3059-			x3 += 512;
3060-			v[0] = (x0 + t3) >> 10;
3061-			v[56] = (x0 - t3) >> 10;
3062-			v[8] = (x1 + t2) >> 10;
3063-			v[48] = (x1 - t2) >> 10;
3064-			v[16] = (x2 + t1) >> 10;
3065-			v[40] = (x2 - t1) >> 10;
3066-			v[24] = (x3 + t0) >> 10;
3067-			v[32] = (x3 - t0) >> 10;
3068-		}
3069-	}
3070-
3071-	for (i = 0, v = val, o = out; i < 8; ++i, v += 8, o += out_stride) {
3072-		// no fast case since the first 1D IDCT spread components out
3073-		STBI__IDCT_1D(v[0], v[1], v[2], v[3], v[4], v[5], v[6], v[7])
3074-		// constants scaled things up by 1<<12, plus we had 1<<2 from first
3075-		// loop, plus horizontal and vertical each scale by sqrt(8) so together
3076-		// we've got an extra 1<<3, so 1<<17 total we need to remove.
3077-		// so we want to round that, which means adding 0.5 * 1<<17,
3078-		// aka 65536. Also, we'll end up with -128 to 127 that we want
3079-		// to encode as 0..255 by adding 128, so we'll add that before the shift
3080-		x0 += 65536 + (128 << 17);
3081-		x1 += 65536 + (128 << 17);
3082-		x2 += 65536 + (128 << 17);
3083-		x3 += 65536 + (128 << 17);
3084-		// tried computing the shifts into temps, or'ing the temps to see
3085-		// if any were out of range, but that was slower
3086-		o[0] = stbi__clamp((x0 + t3) >> 17);
3087-		o[7] = stbi__clamp((x0 - t3) >> 17);
3088-		o[1] = stbi__clamp((x1 + t2) >> 17);
3089-		o[6] = stbi__clamp((x1 - t2) >> 17);
3090-		o[2] = stbi__clamp((x2 + t1) >> 17);
3091-		o[5] = stbi__clamp((x2 - t1) >> 17);
3092-		o[3] = stbi__clamp((x3 + t0) >> 17);
3093-		o[4] = stbi__clamp((x3 - t0) >> 17);
3094-	}
3095-}
3096-
3097-#ifdef STBI_SSE2
3098-// sse2 integer IDCT. not the fastest possible implementation but it
3099-// produces bit-identical results to the generic C version so it's
3100-// fully "transparent".
3101-static void
3102-stbi__idct_simd(stbi_uc *out, int out_stride, short data[64])
3103-{
3104-	// This is constructed to match our regular (generic) integer IDCT exactly.
3105-	__m128i row0, row1, row2, row3, row4, row5, row6, row7;
3106-	__m128i tmp;
3107-
3108-// dot product constant: even elems=x, odd elems=y
3109-#define dct_const(x, y) _mm_setr_epi16((x), (y), (x), (y), (x), (y), (x), (y))
3110-
3111-// out(0) = c0[even]*x + c0[odd]*y   (c0, x, y 16-bit, out 32-bit)
3112-// out(1) = c1[even]*x + c1[odd]*y
3113-#define dct_rot(out0, out1, x, y, c0, c1)                                      \
3114-	__m128i c0##lo = _mm_unpacklo_epi16((x), (y));                             \
3115-	__m128i c0##hi = _mm_unpackhi_epi16((x), (y));                             \
3116-	__m128i out0##_l = _mm_madd_epi16(c0##lo, c0);                             \
3117-	__m128i out0##_h = _mm_madd_epi16(c0##hi, c0);                             \
3118-	__m128i out1##_l = _mm_madd_epi16(c0##lo, c1);                             \
3119-	__m128i out1##_h = _mm_madd_epi16(c0##hi, c1)
3120-
3121-// out = in << 12  (in 16-bit, out 32-bit)
3122-#define dct_widen(out, in)                                                     \
3123-	__m128i out##_l =                                                          \
3124-	    _mm_srai_epi32(_mm_unpacklo_epi16(_mm_setzero_si128(), (in)), 4);      \
3125-	__m128i out##_h =                                                          \
3126-	    _mm_srai_epi32(_mm_unpackhi_epi16(_mm_setzero_si128(), (in)), 4)
3127-
3128-// wide add
3129-#define dct_wadd(out, a, b)                                                    \
3130-	__m128i out##_l = _mm_add_epi32(a##_l, b##_l);                             \
3131-	__m128i out##_h = _mm_add_epi32(a##_h, b##_h)
3132-
3133-// wide sub
3134-#define dct_wsub(out, a, b)                                                    \
3135-	__m128i out##_l = _mm_sub_epi32(a##_l, b##_l);                             \
3136-	__m128i out##_h = _mm_sub_epi32(a##_h, b##_h)
3137-
3138-// butterfly a/b, add bias, then shift by "s" and pack
3139-#define dct_bfly32o(out0, out1, a, b, bias, s)                                 \
3140-	{                                                                          \
3141-		__m128i abiased_l = _mm_add_epi32(a##_l, bias);                        \
3142-		__m128i abiased_h = _mm_add_epi32(a##_h, bias);                        \
3143-		dct_wadd(sum, abiased, b);                                             \
3144-		dct_wsub(dif, abiased, b);                                             \
3145-		out0 = _mm_packs_epi32(_mm_srai_epi32(sum_l, s),                       \
3146-		                       _mm_srai_epi32(sum_h, s));                      \
3147-		out1 = _mm_packs_epi32(_mm_srai_epi32(dif_l, s),                       \
3148-		                       _mm_srai_epi32(dif_h, s));                      \
3149-	}
3150-
3151-// 8-bit interleave step (for transposes)
3152-#define dct_interleave8(a, b)                                                  \
3153-	tmp = a;                                                                   \
3154-	a = _mm_unpacklo_epi8(a, b);                                               \
3155-	b = _mm_unpackhi_epi8(tmp, b)
3156-
3157-// 16-bit interleave step (for transposes)
3158-#define dct_interleave16(a, b)                                                 \
3159-	tmp = a;                                                                   \
3160-	a = _mm_unpacklo_epi16(a, b);                                              \
3161-	b = _mm_unpackhi_epi16(tmp, b)
3162-
3163-#define dct_pass(bias, shift)                                                  \
3164-	{                                                                          \
3165-		/* even part */                                                        \
3166-		dct_rot(t2e, t3e, row2, row6, rot0_0, rot0_1);                         \
3167-		__m128i sum04 = _mm_add_epi16(row0, row4);                             \
3168-		__m128i dif04 = _mm_sub_epi16(row0, row4);                             \
3169-		dct_widen(t0e, sum04);                                                 \
3170-		dct_widen(t1e, dif04);                                                 \
3171-		dct_wadd(x0, t0e, t3e);                                                \
3172-		dct_wsub(x3, t0e, t3e);                                                \
3173-		dct_wadd(x1, t1e, t2e);                                                \
3174-		dct_wsub(x2, t1e, t2e);                                                \
3175-		/* odd part */                                                         \
3176-		dct_rot(y0o, y2o, row7, row3, rot2_0, rot2_1);                         \
3177-		dct_rot(y1o, y3o, row5, row1, rot3_0, rot3_1);                         \
3178-		__m128i sum17 = _mm_add_epi16(row1, row7);                             \
3179-		__m128i sum35 = _mm_add_epi16(row3, row5);                             \
3180-		dct_rot(y4o, y5o, sum17, sum35, rot1_0, rot1_1);                       \
3181-		dct_wadd(x4, y0o, y4o);                                                \
3182-		dct_wadd(x5, y1o, y5o);                                                \
3183-		dct_wadd(x6, y2o, y5o);                                                \
3184-		dct_wadd(x7, y3o, y4o);                                                \
3185-		dct_bfly32o(row0, row7, x0, x7, bias, shift);                          \
3186-		dct_bfly32o(row1, row6, x1, x6, bias, shift);                          \
3187-		dct_bfly32o(row2, row5, x2, x5, bias, shift);                          \
3188-		dct_bfly32o(row3, row4, x3, x4, bias, shift);                          \
3189-	}
3190-
3191-	__m128i rot0_0 =
3192-	    dct_const(stbi__f2f(0.5411961f),
3193-	              stbi__f2f(0.5411961f) + stbi__f2f(-1.847759065f));
3194-	__m128i rot0_1 = dct_const(stbi__f2f(0.5411961f) + stbi__f2f(0.765366865f),
3195-	                           stbi__f2f(0.5411961f));
3196-	__m128i rot1_0 =
3197-	    dct_const(stbi__f2f(1.175875602f) + stbi__f2f(-0.899976223f),
3198-	              stbi__f2f(1.175875602f));
3199-	__m128i rot1_1 =
3200-	    dct_const(stbi__f2f(1.175875602f),
3201-	              stbi__f2f(1.175875602f) + stbi__f2f(-2.562915447f));
3202-	__m128i rot2_0 =
3203-	    dct_const(stbi__f2f(-1.961570560f) + stbi__f2f(0.298631336f),
3204-	              stbi__f2f(-1.961570560f));
3205-	__m128i rot2_1 =
3206-	    dct_const(stbi__f2f(-1.961570560f),
3207-	              stbi__f2f(-1.961570560f) + stbi__f2f(3.072711026f));
3208-	__m128i rot3_0 =
3209-	    dct_const(stbi__f2f(-0.390180644f) + stbi__f2f(2.053119869f),
3210-	              stbi__f2f(-0.390180644f));
3211-	__m128i rot3_1 =
3212-	    dct_const(stbi__f2f(-0.390180644f),
3213-	              stbi__f2f(-0.390180644f) + stbi__f2f(1.501321110f));
3214-
3215-	// rounding biases in column/row passes, see stbi__idct_block for
3216-	// explanation.
3217-	__m128i bias_0 = _mm_set1_epi32(512);
3218-	__m128i bias_1 = _mm_set1_epi32(65536 + (128 << 17));
3219-
3220-	// load
3221-	row0 = _mm_load_si128((const __m128i *)(data + 0 * 8));
3222-	row1 = _mm_load_si128((const __m128i *)(data + 1 * 8));
3223-	row2 = _mm_load_si128((const __m128i *)(data + 2 * 8));
3224-	row3 = _mm_load_si128((const __m128i *)(data + 3 * 8));
3225-	row4 = _mm_load_si128((const __m128i *)(data + 4 * 8));
3226-	row5 = _mm_load_si128((const __m128i *)(data + 5 * 8));
3227-	row6 = _mm_load_si128((const __m128i *)(data + 6 * 8));
3228-	row7 = _mm_load_si128((const __m128i *)(data + 7 * 8));
3229-
3230-	// column pass
3231-	dct_pass(bias_0, 10);
3232-
3233-	{
3234-		// 16bit 8x8 transpose pass 1
3235-		dct_interleave16(row0, row4);
3236-		dct_interleave16(row1, row5);
3237-		dct_interleave16(row2, row6);
3238-		dct_interleave16(row3, row7);
3239-
3240-		// transpose pass 2
3241-		dct_interleave16(row0, row2);
3242-		dct_interleave16(row1, row3);
3243-		dct_interleave16(row4, row6);
3244-		dct_interleave16(row5, row7);
3245-
3246-		// transpose pass 3
3247-		dct_interleave16(row0, row1);
3248-		dct_interleave16(row2, row3);
3249-		dct_interleave16(row4, row5);
3250-		dct_interleave16(row6, row7);
3251-	}
3252-
3253-	// row pass
3254-	dct_pass(bias_1, 17);
3255-
3256-	{
3257-		// pack
3258-		__m128i p0 = _mm_packus_epi16(row0, row1); // a0a1a2a3...a7b0b1b2b3...b7
3259-		__m128i p1 = _mm_packus_epi16(row2, row3);
3260-		__m128i p2 = _mm_packus_epi16(row4, row5);
3261-		__m128i p3 = _mm_packus_epi16(row6, row7);
3262-
3263-		// 8bit 8x8 transpose pass 1
3264-		dct_interleave8(p0, p2); // a0e0a1e1...
3265-		dct_interleave8(p1, p3); // c0g0c1g1...
3266-
3267-		// transpose pass 2
3268-		dct_interleave8(p0, p1); // a0c0e0g0...
3269-		dct_interleave8(p2, p3); // b0d0f0h0...
3270-
3271-		// transpose pass 3
3272-		dct_interleave8(p0, p2); // a0b0c0d0...
3273-		dct_interleave8(p1, p3); // a4b4c4d4...
3274-
3275-		// store
3276-		_mm_storel_epi64((__m128i *)out, p0);
3277-		out += out_stride;
3278-		_mm_storel_epi64((__m128i *)out, _mm_shuffle_epi32(p0, 0x4e));
3279-		out += out_stride;
3280-		_mm_storel_epi64((__m128i *)out, p2);
3281-		out += out_stride;
3282-		_mm_storel_epi64((__m128i *)out, _mm_shuffle_epi32(p2, 0x4e));
3283-		out += out_stride;
3284-		_mm_storel_epi64((__m128i *)out, p1);
3285-		out += out_stride;
3286-		_mm_storel_epi64((__m128i *)out, _mm_shuffle_epi32(p1, 0x4e));
3287-		out += out_stride;
3288-		_mm_storel_epi64((__m128i *)out, p3);
3289-		out += out_stride;
3290-		_mm_storel_epi64((__m128i *)out, _mm_shuffle_epi32(p3, 0x4e));
3291-	}
3292-
3293-#undef dct_const
3294-#undef dct_rot
3295-#undef dct_widen
3296-#undef dct_wadd
3297-#undef dct_wsub
3298-#undef dct_bfly32o
3299-#undef dct_interleave8
3300-#undef dct_interleave16
3301-#undef dct_pass
3302-}
3303-
3304-#endif // STBI_SSE2
3305-
3306-#ifdef STBI_NEON
3307-
3308-// NEON integer IDCT. should produce bit-identical
3309-// results to the generic C version.
3310-static void
3311-stbi__idct_simd(stbi_uc *out, int out_stride, short data[64])
3312-{
3313-	int16x8_t row0, row1, row2, row3, row4, row5, row6, row7;
3314-
3315-	int16x4_t rot0_0 = vdup_n_s16(stbi__f2f(0.5411961f));
3316-	int16x4_t rot0_1 = vdup_n_s16(stbi__f2f(-1.847759065f));
3317-	int16x4_t rot0_2 = vdup_n_s16(stbi__f2f(0.765366865f));
3318-	int16x4_t rot1_0 = vdup_n_s16(stbi__f2f(1.175875602f));
3319-	int16x4_t rot1_1 = vdup_n_s16(stbi__f2f(-0.899976223f));
3320-	int16x4_t rot1_2 = vdup_n_s16(stbi__f2f(-2.562915447f));
3321-	int16x4_t rot2_0 = vdup_n_s16(stbi__f2f(-1.961570560f));
3322-	int16x4_t rot2_1 = vdup_n_s16(stbi__f2f(-0.390180644f));
3323-	int16x4_t rot3_0 = vdup_n_s16(stbi__f2f(0.298631336f));
3324-	int16x4_t rot3_1 = vdup_n_s16(stbi__f2f(2.053119869f));
3325-	int16x4_t rot3_2 = vdup_n_s16(stbi__f2f(3.072711026f));
3326-	int16x4_t rot3_3 = vdup_n_s16(stbi__f2f(1.501321110f));
3327-
3328-#define dct_long_mul(out, inq, coeff)                                          \
3329-	int32x4_t out##_l = vmull_s16(vget_low_s16(inq), coeff);                   \
3330-	int32x4_t out##_h = vmull_s16(vget_high_s16(inq), coeff)
3331-
3332-#define dct_long_mac(out, acc, inq, coeff)                                     \
3333-	int32x4_t out##_l = vmlal_s16(acc##_l, vget_low_s16(inq), coeff);          \
3334-	int32x4_t out##_h = vmlal_s16(acc##_h, vget_high_s16(inq), coeff)
3335-
3336-#define dct_widen(out, inq)                                                    \
3337-	int32x4_t out##_l = vshll_n_s16(vget_low_s16(inq), 12);                    \
3338-	int32x4_t out##_h = vshll_n_s16(vget_high_s16(inq), 12)
3339-
3340-// wide add
3341-#define dct_wadd(out, a, b)                                                    \
3342-	int32x4_t out##_l = vaddq_s32(a##_l, b##_l);                               \
3343-	int32x4_t out##_h = vaddq_s32(a##_h, b##_h)
3344-
3345-// wide sub
3346-#define dct_wsub(out, a, b)                                                    \
3347-	int32x4_t out##_l = vsubq_s32(a##_l, b##_l);                               \
3348-	int32x4_t out##_h = vsubq_s32(a##_h, b##_h)
3349-
3350-// butterfly a/b, then shift using "shiftop" by "s" and pack
3351-#define dct_bfly32o(out0, out1, a, b, shiftop, s)                              \
3352-	{                                                                          \
3353-		dct_wadd(sum, a, b);                                                   \
3354-		dct_wsub(dif, a, b);                                                   \
3355-		out0 = vcombine_s16(shiftop(sum_l, s), shiftop(sum_h, s));             \
3356-		out1 = vcombine_s16(shiftop(dif_l, s), shiftop(dif_h, s));             \
3357-	}
3358-
3359-#define dct_pass(shiftop, shift)                                               \
3360-	{                                                                          \
3361-		/* even part */                                                        \
3362-		int16x8_t sum26 = vaddq_s16(row2, row6);                               \
3363-		dct_long_mul(p1e, sum26, rot0_0);                                      \
3364-		dct_long_mac(t2e, p1e, row6, rot0_1);                                  \
3365-		dct_long_mac(t3e, p1e, row2, rot0_2);                                  \
3366-		int16x8_t sum04 = vaddq_s16(row0, row4);                               \
3367-		int16x8_t dif04 = vsubq_s16(row0, row4);                               \
3368-		dct_widen(t0e, sum04);                                                 \
3369-		dct_widen(t1e, dif04);                                                 \
3370-		dct_wadd(x0, t0e, t3e);                                                \
3371-		dct_wsub(x3, t0e, t3e);                                                \
3372-		dct_wadd(x1, t1e, t2e);                                                \
3373-		dct_wsub(x2, t1e, t2e);                                                \
3374-		/* odd part */                                                         \
3375-		int16x8_t sum15 = vaddq_s16(row1, row5);                               \
3376-		int16x8_t sum17 = vaddq_s16(row1, row7);                               \
3377-		int16x8_t sum35 = vaddq_s16(row3, row5);                               \
3378-		int16x8_t sum37 = vaddq_s16(row3, row7);                               \
3379-		int16x8_t sumodd = vaddq_s16(sum17, sum35);                            \
3380-		dct_long_mul(p5o, sumodd, rot1_0);                                     \
3381-		dct_long_mac(p1o, p5o, sum17, rot1_1);                                 \
3382-		dct_long_mac(p2o, p5o, sum35, rot1_2);                                 \
3383-		dct_long_mul(p3o, sum37, rot2_0);                                      \
3384-		dct_long_mul(p4o, sum15, rot2_1);                                      \
3385-		dct_wadd(sump13o, p1o, p3o);                                           \
3386-		dct_wadd(sump24o, p2o, p4o);                                           \
3387-		dct_wadd(sump23o, p2o, p3o);                                           \
3388-		dct_wadd(sump14o, p1o, p4o);                                           \
3389-		dct_long_mac(x4, sump13o, row7, rot3_0);                               \
3390-		dct_long_mac(x5, sump24o, row5, rot3_1);                               \
3391-		dct_long_mac(x6, sump23o, row3, rot3_2);                               \
3392-		dct_long_mac(x7, sump14o, row1, rot3_3);                               \
3393-		dct_bfly32o(row0, row7, x0, x7, shiftop, shift);                       \
3394-		dct_bfly32o(row1, row6, x1, x6, shiftop, shift);                       \
3395-		dct_bfly32o(row2, row5, x2, x5, shiftop, shift);                       \
3396-		dct_bfly32o(row3, row4, x3, x4, shiftop, shift);                       \
3397-	}
3398-
3399-	// load
3400-	row0 = vld1q_s16(data + 0 * 8);
3401-	row1 = vld1q_s16(data + 1 * 8);
3402-	row2 = vld1q_s16(data + 2 * 8);
3403-	row3 = vld1q_s16(data + 3 * 8);
3404-	row4 = vld1q_s16(data + 4 * 8);
3405-	row5 = vld1q_s16(data + 5 * 8);
3406-	row6 = vld1q_s16(data + 6 * 8);
3407-	row7 = vld1q_s16(data + 7 * 8);
3408-
3409-	// add DC bias
3410-	row0 = vaddq_s16(row0, vsetq_lane_s16(1024, vdupq_n_s16(0), 0));
3411-
3412-	// column pass
3413-	dct_pass(vrshrn_n_s32, 10);
3414-
3415-	// 16bit 8x8 transpose
3416-	{
3417-// these three map to a single VTRN.16, VTRN.32, and VSWP, respectively.
3418-// whether compilers actually get this is another story, sadly.
3419-#define dct_trn16(x, y)                                                        \
3420-	{                                                                          \
3421-		int16x8x2_t t = vtrnq_s16(x, y);                                       \
3422-		x = t.val[0];                                                          \
3423-		y = t.val[1];                                                          \
3424-	}
3425-#define dct_trn32(x, y)                                                        \
3426-	{                                                                          \
3427-		int32x4x2_t t =                                                        \
3428-		    vtrnq_s32(vreinterpretq_s32_s16(x), vreinterpretq_s32_s16(y));     \
3429-		x = vreinterpretq_s16_s32(t.val[0]);                                   \
3430-		y = vreinterpretq_s16_s32(t.val[1]);                                   \
3431-	}
3432-#define dct_trn64(x, y)                                                        \
3433-	{                                                                          \
3434-		int16x8_t x0 = x;                                                      \
3435-		int16x8_t y0 = y;                                                      \
3436-		x = vcombine_s16(vget_low_s16(x0), vget_low_s16(y0));                  \
3437-		y = vcombine_s16(vget_high_s16(x0), vget_high_s16(y0));                \
3438-	}
3439-
3440-		// pass 1
3441-		dct_trn16(row0, row1); // a0b0a2b2a4b4a6b6
3442-		dct_trn16(row2, row3);
3443-		dct_trn16(row4, row5);
3444-		dct_trn16(row6, row7);
3445-
3446-		// pass 2
3447-		dct_trn32(row0, row2); // a0b0c0d0a4b4c4d4
3448-		dct_trn32(row1, row3);
3449-		dct_trn32(row4, row6);
3450-		dct_trn32(row5, row7);
3451-
3452-		// pass 3
3453-		dct_trn64(row0, row4); // a0b0c0d0e0f0g0h0
3454-		dct_trn64(row1, row5);
3455-		dct_trn64(row2, row6);
3456-		dct_trn64(row3, row7);
3457-
3458-#undef dct_trn16
3459-#undef dct_trn32
3460-#undef dct_trn64
3461-	}
3462-
3463-	// row pass
3464-	// vrshrn_n_s32 only supports shifts up to 16, we need
3465-	// 17. so do a non-rounding shift of 16 first then follow
3466-	// up with a rounding shift by 1.
3467-	dct_pass(vshrn_n_s32, 16);
3468-
3469-	{
3470-		// pack and round
3471-		uint8x8_t p0 = vqrshrun_n_s16(row0, 1);
3472-		uint8x8_t p1 = vqrshrun_n_s16(row1, 1);
3473-		uint8x8_t p2 = vqrshrun_n_s16(row2, 1);
3474-		uint8x8_t p3 = vqrshrun_n_s16(row3, 1);
3475-		uint8x8_t p4 = vqrshrun_n_s16(row4, 1);
3476-		uint8x8_t p5 = vqrshrun_n_s16(row5, 1);
3477-		uint8x8_t p6 = vqrshrun_n_s16(row6, 1);
3478-		uint8x8_t p7 = vqrshrun_n_s16(row7, 1);
3479-
3480-		// again, these can translate into one instruction, but often don't.
3481-#define dct_trn8_8(x, y)                                                       \
3482-	{                                                                          \
3483-		uint8x8x2_t t = vtrn_u8(x, y);                                         \
3484-		x = t.val[0];                                                          \
3485-		y = t.val[1];                                                          \
3486-	}
3487-#define dct_trn8_16(x, y)                                                      \
3488-	{                                                                          \
3489-		uint16x4x2_t t =                                                       \
3490-		    vtrn_u16(vreinterpret_u16_u8(x), vreinterpret_u16_u8(y));          \
3491-		x = vreinterpret_u8_u16(t.val[0]);                                     \
3492-		y = vreinterpret_u8_u16(t.val[1]);                                     \
3493-	}
3494-#define dct_trn8_32(x, y)                                                      \
3495-	{                                                                          \
3496-		uint32x2x2_t t =                                                       \
3497-		    vtrn_u32(vreinterpret_u32_u8(x), vreinterpret_u32_u8(y));          \
3498-		x = vreinterpret_u8_u32(t.val[0]);                                     \
3499-		y = vreinterpret_u8_u32(t.val[1]);                                     \
3500-	}
3501-
3502-		// sadly can't use interleaved stores here since we only write
3503-		// 8 bytes to each scan line!
3504-
3505-		// 8x8 8-bit transpose pass 1
3506-		dct_trn8_8(p0, p1);
3507-		dct_trn8_8(p2, p3);
3508-		dct_trn8_8(p4, p5);
3509-		dct_trn8_8(p6, p7);
3510-
3511-		// pass 2
3512-		dct_trn8_16(p0, p2);
3513-		dct_trn8_16(p1, p3);
3514-		dct_trn8_16(p4, p6);
3515-		dct_trn8_16(p5, p7);
3516-
3517-		// pass 3
3518-		dct_trn8_32(p0, p4);
3519-		dct_trn8_32(p1, p5);
3520-		dct_trn8_32(p2, p6);
3521-		dct_trn8_32(p3, p7);
3522-
3523-		// store
3524-		vst1_u8(out, p0);
3525-		out += out_stride;
3526-		vst1_u8(out, p1);
3527-		out += out_stride;
3528-		vst1_u8(out, p2);
3529-		out += out_stride;
3530-		vst1_u8(out, p3);
3531-		out += out_stride;
3532-		vst1_u8(out, p4);
3533-		out += out_stride;
3534-		vst1_u8(out, p5);
3535-		out += out_stride;
3536-		vst1_u8(out, p6);
3537-		out += out_stride;
3538-		vst1_u8(out, p7);
3539-
3540-#undef dct_trn8_8
3541-#undef dct_trn8_16
3542-#undef dct_trn8_32
3543-	}
3544-
3545-#undef dct_long_mul
3546-#undef dct_long_mac
3547-#undef dct_widen
3548-#undef dct_wadd
3549-#undef dct_wsub
3550-#undef dct_bfly32o
3551-#undef dct_pass
3552-}
3553-
3554-#endif // STBI_NEON
3555-
3556-#define STBI__MARKER_none 0xff
3557-// if there's a pending marker from the entropy stream, return that
3558-// otherwise, fetch from the stream and get a marker. if there's no
3559-// marker, return 0xff, which is never a valid marker value
3560-static stbi_uc
3561-stbi__get_marker(stbi__jpeg *j)
3562-{
3563-	stbi_uc x;
3564-	if (j->marker != STBI__MARKER_none) {
3565-		x = j->marker;
3566-		j->marker = STBI__MARKER_none;
3567-		return x;
3568-	}
3569-	x = stbi__get8(j->s);
3570-	if (x != 0xff) {
3571-		return STBI__MARKER_none;
3572-	}
3573-	while (x == 0xff) {
3574-		x = stbi__get8(j->s); // consume repeated 0xff fill bytes
3575-	}
3576-	return x;
3577-}
3578-
3579-// in each scan, we'll have scan_n components, and the order
3580-// of the components is specified by order[]
3581-#define STBI__RESTART(x) ((x) >= 0xd0 && (x) <= 0xd7)
3582-
3583-// after a restart interval, stbi__jpeg_reset the entropy decoder and
3584-// the dc prediction
3585-static void
3586-stbi__jpeg_reset(stbi__jpeg *j)
3587-{
3588-	j->code_bits = 0;
3589-	j->code_buffer = 0;
3590-	j->nomore = 0;
3591-	j->img_comp[0].dc_pred = j->img_comp[1].dc_pred = j->img_comp[2].dc_pred =
3592-	    j->img_comp[3].dc_pred = 0;
3593-	j->marker = STBI__MARKER_none;
3594-	j->todo = j->restart_interval ? j->restart_interval : 0x7fffffff;
3595-	j->eob_run = 0;
3596-	// no more than 1<<31 MCUs if no restart_interal? that's plenty safe,
3597-	// since we don't even allow 1<<30 pixels
3598-}
3599-
3600-static int
3601-stbi__parse_entropy_coded_data(stbi__jpeg *z)
3602-{
3603-	stbi__jpeg_reset(z);
3604-	if (!z->progressive) {
3605-		if (z->scan_n == 1) {
3606-			int i, j;
3607-			STBI_SIMD_ALIGN(short, data[64]);
3608-			int n = z->order[0];
3609-			// non-interleaved data, we just need to process one block at a
3610-			// time, in trivial scanline order number of blocks to do just
3611-			// depends on how many actual "pixels" this component has,
3612-			// independent of interleaved MCU blocking and such
3613-			int w = (z->img_comp[n].x + 7) >> 3;
3614-			int h = (z->img_comp[n].y + 7) >> 3;
3615-			for (j = 0; j < h; ++j) {
3616-				for (i = 0; i < w; ++i) {
3617-					int ha = z->img_comp[n].ha;
3618-					if (!stbi__jpeg_decode_block(
3619-					        z, data, z->huff_dc + z->img_comp[n].hd,
3620-					        z->huff_ac + ha, z->fast_ac[ha], n,
3621-					        z->dequant[z->img_comp[n].tq])) {
3622-						return 0;
3623-					}
3624-					z->idct_block_kernel(z->img_comp[n].data +
3625-					                         z->img_comp[n].w2 * j * 8 + i * 8,
3626-					                     z->img_comp[n].w2, data);
3627-					// every data block is an MCU, so countdown the restart
3628-					// interval
3629-					if (--z->todo <= 0) {
3630-						if (z->code_bits < 24) {
3631-							stbi__grow_buffer_unsafe(z);
3632-						}
3633-						// if it's NOT a restart, then just bail, so we get
3634-						// corrupt data rather than no data
3635-						if (!STBI__RESTART(z->marker)) {
3636-							return 1;
3637-						}
3638-						stbi__jpeg_reset(z);
3639-					}
3640-				}
3641-			}
3642-			return 1;
3643-		} else { // interleaved
3644-			int i, j, k, x, y;
3645-			STBI_SIMD_ALIGN(short, data[64]);
3646-			for (j = 0; j < z->img_mcu_y; ++j) {
3647-				for (i = 0; i < z->img_mcu_x; ++i) {
3648-					// scan an interleaved mcu... process scan_n components in
3649-					// order
3650-					for (k = 0; k < z->scan_n; ++k) {
3651-						int n = z->order[k];
3652-						// scan out an mcu's worth of this component; that's
3653-						// just determined by the basic H and V specified for
3654-						// the component
3655-						for (y = 0; y < z->img_comp[n].v; ++y) {
3656-							for (x = 0; x < z->img_comp[n].h; ++x) {
3657-								int x2 = (i * z->img_comp[n].h + x) * 8;
3658-								int y2 = (j * z->img_comp[n].v + y) * 8;
3659-								int ha = z->img_comp[n].ha;
3660-								if (!stbi__jpeg_decode_block(
3661-								        z, data, z->huff_dc + z->img_comp[n].hd,
3662-								        z->huff_ac + ha, z->fast_ac[ha], n,
3663-								        z->dequant[z->img_comp[n].tq])) {
3664-									return 0;
3665-								}
3666-								z->idct_block_kernel(
3667-								    z->img_comp[n].data +
3668-								        z->img_comp[n].w2 * y2 + x2,
3669-								    z->img_comp[n].w2, data);
3670-							}
3671-						}
3672-					}
3673-					// after all interleaved components, that's an interleaved
3674-					// MCU, so now count down the restart interval
3675-					if (--z->todo <= 0) {
3676-						if (z->code_bits < 24) {
3677-							stbi__grow_buffer_unsafe(z);
3678-						}
3679-						if (!STBI__RESTART(z->marker)) {
3680-							return 1;
3681-						}
3682-						stbi__jpeg_reset(z);
3683-					}
3684-				}
3685-			}
3686-			return 1;
3687-		}
3688-	} else {
3689-		if (z->scan_n == 1) {
3690-			int i, j;
3691-			int n = z->order[0];
3692-			// non-interleaved data, we just need to process one block at a
3693-			// time, in trivial scanline order number of blocks to do just
3694-			// depends on how many actual "pixels" this component has,
3695-			// independent of interleaved MCU blocking and such
3696-			int w = (z->img_comp[n].x + 7) >> 3;
3697-			int h = (z->img_comp[n].y + 7) >> 3;
3698-			for (j = 0; j < h; ++j) {
3699-				for (i = 0; i < w; ++i) {
3700-					short *data = z->img_comp[n].coeff +
3701-					              64 * (i + j * z->img_comp[n].coeff_w);
3702-					if (z->spec_start == 0) {
3703-						if (!stbi__jpeg_decode_block_prog_dc(
3704-						        z, data, &z->huff_dc[z->img_comp[n].hd], n)) {
3705-							return 0;
3706-						}
3707-					} else {
3708-						int ha = z->img_comp[n].ha;
3709-						if (!stbi__jpeg_decode_block_prog_ac(
3710-						        z, data, &z->huff_ac[ha], z->fast_ac[ha])) {
3711-							return 0;
3712-						}
3713-					}
3714-					// every data block is an MCU, so countdown the restart
3715-					// interval
3716-					if (--z->todo <= 0) {
3717-						if (z->code_bits < 24) {
3718-							stbi__grow_buffer_unsafe(z);
3719-						}
3720-						if (!STBI__RESTART(z->marker)) {
3721-							return 1;
3722-						}
3723-						stbi__jpeg_reset(z);
3724-					}
3725-				}
3726-			}
3727-			return 1;
3728-		} else { // interleaved
3729-			int i, j, k, x, y;
3730-			for (j = 0; j < z->img_mcu_y; ++j) {
3731-				for (i = 0; i < z->img_mcu_x; ++i) {
3732-					// scan an interleaved mcu... process scan_n components in
3733-					// order
3734-					for (k = 0; k < z->scan_n; ++k) {
3735-						int n = z->order[k];
3736-						// scan out an mcu's worth of this component; that's
3737-						// just determined by the basic H and V specified for
3738-						// the component
3739-						for (y = 0; y < z->img_comp[n].v; ++y) {
3740-							for (x = 0; x < z->img_comp[n].h; ++x) {
3741-								int x2 = (i * z->img_comp[n].h + x);
3742-								int y2 = (j * z->img_comp[n].v + y);
3743-								short *data =
3744-								    z->img_comp[n].coeff +
3745-								    64 * (x2 + y2 * z->img_comp[n].coeff_w);
3746-								if (!stbi__jpeg_decode_block_prog_dc(
3747-								        z, data, &z->huff_dc[z->img_comp[n].hd],
3748-								        n)) {
3749-									return 0;
3750-								}
3751-							}
3752-						}
3753-					}
3754-					// after all interleaved components, that's an interleaved
3755-					// MCU, so now count down the restart interval
3756-					if (--z->todo <= 0) {
3757-						if (z->code_bits < 24) {
3758-							stbi__grow_buffer_unsafe(z);
3759-						}
3760-						if (!STBI__RESTART(z->marker)) {
3761-							return 1;
3762-						}
3763-						stbi__jpeg_reset(z);
3764-					}
3765-				}
3766-			}
3767-			return 1;
3768-		}
3769-	}
3770-}
3771-
3772-static void
3773-stbi__jpeg_dequantize(short *data, stbi__uint16 *dequant)
3774-{
3775-	int i;
3776-	for (i = 0; i < 64; ++i) {
3777-		data[i] *= dequant[i];
3778-	}
3779-}
3780-
3781-static void
3782-stbi__jpeg_finish(stbi__jpeg *z)
3783-{
3784-	if (z->progressive) {
3785-		// dequantize and idct the data
3786-		int i, j, n;
3787-		for (n = 0; n < z->s->img_n; ++n) {
3788-			int w = (z->img_comp[n].x + 7) >> 3;
3789-			int h = (z->img_comp[n].y + 7) >> 3;
3790-			for (j = 0; j < h; ++j) {
3791-				for (i = 0; i < w; ++i) {
3792-					short *data = z->img_comp[n].coeff +
3793-					              64 * (i + j * z->img_comp[n].coeff_w);
3794-					stbi__jpeg_dequantize(data, z->dequant[z->img_comp[n].tq]);
3795-					z->idct_block_kernel(z->img_comp[n].data +
3796-					                         z->img_comp[n].w2 * j * 8 + i * 8,
3797-					                     z->img_comp[n].w2, data);
3798-				}
3799-			}
3800-		}
3801-	}
3802-}
3803-
3804-static int
3805-stbi__process_marker(stbi__jpeg *z, int m)
3806-{
3807-	int L;
3808-	switch (m) {
3809-	case STBI__MARKER_none: // no marker found
3810-		return stbi__err("expected marker", "Corrupt JPEG");
3811-
3812-	case 0xDD: // DRI - specify restart interval
3813-		if (stbi__get16be(z->s) != 4) {
3814-			return stbi__err("bad DRI len", "Corrupt JPEG");
3815-		}
3816-		z->restart_interval = stbi__get16be(z->s);
3817-		return 1;
3818-
3819-	case 0xDB: // DQT - define quantization table
3820-		L = stbi__get16be(z->s) - 2;
3821-		while (L > 0) {
3822-			int q = stbi__get8(z->s);
3823-			int p = q >> 4, sixteen = (p != 0);
3824-			int t = q & 15, i;
3825-			if (p != 0 && p != 1) {
3826-				return stbi__err("bad DQT type", "Corrupt JPEG");
3827-			}
3828-			if (t > 3) {
3829-				return stbi__err("bad DQT table", "Corrupt JPEG");
3830-			}
3831-
3832-			for (i = 0; i < 64; ++i) {
3833-				z->dequant[t][stbi__jpeg_dezigzag[i]] =
3834-				    (stbi__uint16)(sixteen ? stbi__get16be(z->s)
3835-				                           : stbi__get8(z->s));
3836-			}
3837-			L -= (sixteen ? 129 : 65);
3838-		}
3839-		return L == 0;
3840-
3841-	case 0xC4: // DHT - define huffman table
3842-		L = stbi__get16be(z->s) - 2;
3843-		while (L > 0) {
3844-			stbi_uc *v;
3845-			int sizes[16], i, n = 0;
3846-			int q = stbi__get8(z->s);
3847-			int tc = q >> 4;
3848-			int th = q & 15;
3849-			if (tc > 1 || th > 3) {
3850-				return stbi__err("bad DHT header", "Corrupt JPEG");
3851-			}
3852-			for (i = 0; i < 16; ++i) {
3853-				sizes[i] = stbi__get8(z->s);
3854-				n += sizes[i];
3855-			}
3856-			if (n > 256) {
3857-				return stbi__err("bad DHT header",
3858-				                 "Corrupt JPEG"); // Loop over i < n would write
3859-				                                  // past end of values!
3860-			}
3861-			L -= 17;
3862-			if (tc == 0) {
3863-				if (!stbi__build_huffman(z->huff_dc + th, sizes)) {
3864-					return 0;
3865-				}
3866-				v = z->huff_dc[th].values;
3867-			} else {
3868-				if (!stbi__build_huffman(z->huff_ac + th, sizes)) {
3869-					return 0;
3870-				}
3871-				v = z->huff_ac[th].values;
3872-			}
3873-			for (i = 0; i < n; ++i) {
3874-				v[i] = stbi__get8(z->s);
3875-			}
3876-			if (tc != 0) {
3877-				stbi__build_fast_ac(z->fast_ac[th], z->huff_ac + th);
3878-			}
3879-			L -= n;
3880-		}
3881-		return L == 0;
3882-	}
3883-
3884-	// check for comment block or APP blocks
3885-	if ((m >= 0xE0 && m <= 0xEF) || m == 0xFE) {
3886-		L = stbi__get16be(z->s);
3887-		if (L < 2) {
3888-			if (m == 0xFE) {
3889-				return stbi__err("bad COM len", "Corrupt JPEG");
3890-			} else {
3891-				return stbi__err("bad APP len", "Corrupt JPEG");
3892-			}
3893-		}
3894-		L -= 2;
3895-
3896-		if (m == 0xE0 && L >= 5) { // JFIF APP0 segment
3897-			static const unsigned char tag[5] = {'J', 'F', 'I', 'F', '\0'};
3898-			int ok = 1;
3899-			int i;
3900-			for (i = 0; i < 5; ++i) {
3901-				if (stbi__get8(z->s) != tag[i]) {
3902-					ok = 0;
3903-				}
3904-			}
3905-			L -= 5;
3906-			if (ok) {
3907-				z->jfif = 1;
3908-			}
3909-		} else if (m == 0xEE && L >= 12) { // Adobe APP14 segment
3910-			static const unsigned char tag[6] = {'A', 'd', 'o', 'b', 'e', '\0'};
3911-			int ok = 1;
3912-			int i;
3913-			for (i = 0; i < 6; ++i) {
3914-				if (stbi__get8(z->s) != tag[i]) {
3915-					ok = 0;
3916-				}
3917-			}
3918-			L -= 6;
3919-			if (ok) {
3920-				stbi__get8(z->s);                            // version
3921-				stbi__get16be(z->s);                         // flags0
3922-				stbi__get16be(z->s);                         // flags1
3923-				z->app14_color_transform = stbi__get8(z->s); // color transform
3924-				L -= 6;
3925-			}
3926-		}
3927-
3928-		stbi__skip(z->s, L);
3929-		return 1;
3930-	}
3931-
3932-	return stbi__err("unknown marker", "Corrupt JPEG");
3933-}
3934-
3935-// after we see SOS
3936-static int
3937-stbi__process_scan_header(stbi__jpeg *z)
3938-{
3939-	int i;
3940-	int Ls = stbi__get16be(z->s);
3941-	z->scan_n = stbi__get8(z->s);
3942-	if (z->scan_n < 1 || z->scan_n > 4 || z->scan_n > (int)z->s->img_n) {
3943-		return stbi__err("bad SOS component count", "Corrupt JPEG");
3944-	}
3945-	if (Ls != 6 + 2 * z->scan_n) {
3946-		return stbi__err("bad SOS len", "Corrupt JPEG");
3947-	}
3948-	for (i = 0; i < z->scan_n; ++i) {
3949-		int id = stbi__get8(z->s), which;
3950-		int q = stbi__get8(z->s);
3951-		for (which = 0; which < z->s->img_n; ++which) {
3952-			if (z->img_comp[which].id == id) {
3953-				break;
3954-			}
3955-		}
3956-		if (which == z->s->img_n) {
3957-			return 0; // no match
3958-		}
3959-		z->img_comp[which].hd = q >> 4;
3960-		if (z->img_comp[which].hd > 3) {
3961-			return stbi__err("bad DC huff", "Corrupt JPEG");
3962-		}
3963-		z->img_comp[which].ha = q & 15;
3964-		if (z->img_comp[which].ha > 3) {
3965-			return stbi__err("bad AC huff", "Corrupt JPEG");
3966-		}
3967-		z->order[i] = which;
3968-	}
3969-
3970-	{
3971-		int aa;
3972-		z->spec_start = stbi__get8(z->s);
3973-		z->spec_end = stbi__get8(z->s); // should be 63, but might be 0
3974-		aa = stbi__get8(z->s);
3975-		z->succ_high = (aa >> 4);
3976-		z->succ_low = (aa & 15);
3977-		if (z->progressive) {
3978-			if (z->spec_start > 63 || z->spec_end > 63 ||
3979-			    z->spec_start > z->spec_end || z->succ_high > 13 ||
3980-			    z->succ_low > 13) {
3981-				return stbi__err("bad SOS", "Corrupt JPEG");
3982-			}
3983-		} else {
3984-			if (z->spec_start != 0) {
3985-				return stbi__err("bad SOS", "Corrupt JPEG");
3986-			}
3987-			if (z->succ_high != 0 || z->succ_low != 0) {
3988-				return stbi__err("bad SOS", "Corrupt JPEG");
3989-			}
3990-			z->spec_end = 63;
3991-		}
3992-	}
3993-
3994-	return 1;
3995-}
3996-
3997-static int
3998-stbi__free_jpeg_components(stbi__jpeg *z, int ncomp, int why)
3999-{
4000-	int i;
4001-	for (i = 0; i < ncomp; ++i) {
4002-		if (z->img_comp[i].raw_data) {
4003-			STBI_FREE(z->img_comp[i].raw_data);
4004-			z->img_comp[i].raw_data = NULL;
4005-			z->img_comp[i].data = NULL;
4006-		}
4007-		if (z->img_comp[i].raw_coeff) {
4008-			STBI_FREE(z->img_comp[i].raw_coeff);
4009-			z->img_comp[i].raw_coeff = 0;
4010-			z->img_comp[i].coeff = 0;
4011-		}
4012-		if (z->img_comp[i].linebuf) {
4013-			STBI_FREE(z->img_comp[i].linebuf);
4014-			z->img_comp[i].linebuf = NULL;
4015-		}
4016-	}
4017-	return why;
4018-}
4019-
4020-static int
4021-stbi__process_frame_header(stbi__jpeg *z, int scan)
4022-{
4023-	stbi__context *s = z->s;
4024-	int Lf, p, i, q, h_max = 1, v_max = 1, c;
4025-	Lf = stbi__get16be(s);
4026-	if (Lf < 11) {
4027-		return stbi__err("bad SOF len", "Corrupt JPEG"); // JPEG
4028-	}
4029-	p = stbi__get8(s);
4030-	if (p != 8) {
4031-		return stbi__err(
4032-		    "only 8-bit",
4033-		    "JPEG format not supported: 8-bit only"); // JPEG baseline
4034-	}
4035-	s->img_y = stbi__get16be(s);
4036-	if (s->img_y == 0) {
4037-		return stbi__err(
4038-		    "no header height",
4039-		    "JPEG format not supported: delayed height"); // Legal, but we don't
4040-		                                                  // handle it--but
4041-		                                                  // neither does IJG
4042-	}
4043-	s->img_x = stbi__get16be(s);
4044-	if (s->img_x == 0) {
4045-		return stbi__err("0 width", "Corrupt JPEG"); // JPEG requires
4046-	}
4047-	if (s->img_y > STBI_MAX_DIMENSIONS) {
4048-		return stbi__err("too large", "Very large image (corrupt?)");
4049-	}
4050-	if (s->img_x > STBI_MAX_DIMENSIONS) {
4051-		return stbi__err("too large", "Very large image (corrupt?)");
4052-	}
4053-	c = stbi__get8(s);
4054-	if (c != 3 && c != 1 && c != 4) {
4055-		return stbi__err("bad component count", "Corrupt JPEG");
4056-	}
4057-	s->img_n = c;
4058-	for (i = 0; i < c; ++i) {
4059-		z->img_comp[i].data = NULL;
4060-		z->img_comp[i].linebuf = NULL;
4061-	}
4062-
4063-	if (Lf != 8 + 3 * s->img_n) {
4064-		return stbi__err("bad SOF len", "Corrupt JPEG");
4065-	}
4066-
4067-	z->rgb = 0;
4068-	for (i = 0; i < s->img_n; ++i) {
4069-		static const unsigned char rgb[3] = {'R', 'G', 'B'};
4070-		z->img_comp[i].id = stbi__get8(s);
4071-		if (s->img_n == 3 && z->img_comp[i].id == rgb[i]) {
4072-			++z->rgb;
4073-		}
4074-		q = stbi__get8(s);
4075-		z->img_comp[i].h = (q >> 4);
4076-		if (!z->img_comp[i].h || z->img_comp[i].h > 4) {
4077-			return stbi__err("bad H", "Corrupt JPEG");
4078-		}
4079-		z->img_comp[i].v = q & 15;
4080-		if (!z->img_comp[i].v || z->img_comp[i].v > 4) {
4081-			return stbi__err("bad V", "Corrupt JPEG");
4082-		}
4083-		z->img_comp[i].tq = stbi__get8(s);
4084-		if (z->img_comp[i].tq > 3) {
4085-			return stbi__err("bad TQ", "Corrupt JPEG");
4086-		}
4087-	}
4088-
4089-	if (scan != STBI__SCAN_load) {
4090-		return 1;
4091-	}
4092-
4093-	if (!stbi__mad3sizes_valid(s->img_x, s->img_y, s->img_n, 0)) {
4094-		return stbi__err("too large", "Image too large to decode");
4095-	}
4096-
4097-	for (i = 0; i < s->img_n; ++i) {
4098-		if (z->img_comp[i].h > h_max) {
4099-			h_max = z->img_comp[i].h;
4100-		}
4101-		if (z->img_comp[i].v > v_max) {
4102-			v_max = z->img_comp[i].v;
4103-		}
4104-	}
4105-
4106-	// check that plane subsampling factors are integer ratios; our resamplers
4107-	// can't deal with fractional ratios and I've never seen a non-corrupted
4108-	// JPEG file actually use them
4109-	for (i = 0; i < s->img_n; ++i) {
4110-		if (h_max % z->img_comp[i].h != 0) {
4111-			return stbi__err("bad H", "Corrupt JPEG");
4112-		}
4113-		if (v_max % z->img_comp[i].v != 0) {
4114-			return stbi__err("bad V", "Corrupt JPEG");
4115-		}
4116-	}
4117-
4118-	// compute interleaved mcu info
4119-	z->img_h_max = h_max;
4120-	z->img_v_max = v_max;
4121-	z->img_mcu_w = h_max * 8;
4122-	z->img_mcu_h = v_max * 8;
4123-	// these sizes can't be more than 17 bits
4124-	z->img_mcu_x = (s->img_x + z->img_mcu_w - 1) / z->img_mcu_w;
4125-	z->img_mcu_y = (s->img_y + z->img_mcu_h - 1) / z->img_mcu_h;
4126-
4127-	for (i = 0; i < s->img_n; ++i) {
4128-		// number of effective pixels (e.g. for non-interleaved MCU)
4129-		z->img_comp[i].x = (s->img_x * z->img_comp[i].h + h_max - 1) / h_max;
4130-		z->img_comp[i].y = (s->img_y * z->img_comp[i].v + v_max - 1) / v_max;
4131-		// to simplify generation, we'll allocate enough memory to decode
4132-		// the bogus oversized data from using interleaved MCUs and their
4133-		// big blocks (e.g. a 16x16 iMCU on an image of width 33); we won't
4134-		// discard the extra data until colorspace conversion
4135-		//
4136-		// img_mcu_x, img_mcu_y: <=17 bits; comp[i].h and .v are <=4 (checked
4137-		// earlier) so these muls can't overflow with 32-bit ints (which we
4138-		// require)
4139-		z->img_comp[i].w2 = z->img_mcu_x * z->img_comp[i].h * 8;
4140-		z->img_comp[i].h2 = z->img_mcu_y * z->img_comp[i].v * 8;
4141-		z->img_comp[i].coeff = 0;
4142-		z->img_comp[i].raw_coeff = 0;
4143-		z->img_comp[i].linebuf = NULL;
4144-		z->img_comp[i].raw_data =
4145-		    stbi__malloc_mad2(z->img_comp[i].w2, z->img_comp[i].h2, 15);
4146-		if (z->img_comp[i].raw_data == NULL) {
4147-			return stbi__free_jpeg_components(
4148-			    z, i + 1, stbi__err("outofmem", "Out of memory"));
4149-		}
4150-		// align blocks for idct using mmx/sse
4151-		z->img_comp[i].data =
4152-		    (stbi_uc *)(((size_t)z->img_comp[i].raw_data + 15) & ~15);
4153-		if (z->progressive) {
4154-			// w2, h2 are multiples of 8 (see above)
4155-			z->img_comp[i].coeff_w = z->img_comp[i].w2 / 8;
4156-			z->img_comp[i].coeff_h = z->img_comp[i].h2 / 8;
4157-			z->img_comp[i].raw_coeff = stbi__malloc_mad3(
4158-			    z->img_comp[i].w2, z->img_comp[i].h2, sizeof(short), 15);
4159-			if (z->img_comp[i].raw_coeff == NULL) {
4160-				return stbi__free_jpeg_components(
4161-				    z, i + 1, stbi__err("outofmem", "Out of memory"));
4162-			}
4163-			z->img_comp[i].coeff =
4164-			    (short *)(((size_t)z->img_comp[i].raw_coeff + 15) & ~15);
4165-		}
4166-	}
4167-
4168-	return 1;
4169-}
4170-
4171-// use comparisons since in some cases we handle more than one case (e.g. SOF)
4172-#define stbi__DNL(x) ((x) == 0xdc)
4173-#define stbi__SOI(x) ((x) == 0xd8)
4174-#define stbi__EOI(x) ((x) == 0xd9)
4175-#define stbi__SOF(x) ((x) == 0xc0 || (x) == 0xc1 || (x) == 0xc2)
4176-#define stbi__SOS(x) ((x) == 0xda)
4177-
4178-#define stbi__SOF_progressive(x) ((x) == 0xc2)
4179-
4180-static int
4181-stbi__decode_jpeg_header(stbi__jpeg *z, int scan)
4182-{
4183-	int m;
4184-	z->jfif = 0;
4185-	z->app14_color_transform = -1; // valid values are 0,1,2
4186-	z->marker = STBI__MARKER_none; // initialize cached marker to empty
4187-	m = stbi__get_marker(z);
4188-	if (!stbi__SOI(m)) {
4189-		return stbi__err("no SOI", "Corrupt JPEG");
4190-	}
4191-	if (scan == STBI__SCAN_type) {
4192-		return 1;
4193-	}
4194-	m = stbi__get_marker(z);
4195-	while (!stbi__SOF(m)) {
4196-		if (!stbi__process_marker(z, m)) {
4197-			return 0;
4198-		}
4199-		m = stbi__get_marker(z);
4200-		while (m == STBI__MARKER_none) {
4201-			// some files have extra padding after their blocks, so ok, we'll
4202-			// scan
4203-			if (stbi__at_eof(z->s)) {
4204-				return stbi__err("no SOF", "Corrupt JPEG");
4205-			}
4206-			m = stbi__get_marker(z);
4207-		}
4208-	}
4209-	z->progressive = stbi__SOF_progressive(m);
4210-	if (!stbi__process_frame_header(z, scan)) {
4211-		return 0;
4212-	}
4213-	return 1;
4214-}
4215-
4216-static stbi_uc
4217-stbi__skip_jpeg_junk_at_end(stbi__jpeg *j)
4218-{
4219-	// some JPEGs have junk at end, skip over it but if we find what looks
4220-	// like a valid marker, resume there
4221-	while (!stbi__at_eof(j->s)) {
4222-		stbi_uc x = stbi__get8(j->s);
4223-		while (x == 0xff) { // might be a marker
4224-			if (stbi__at_eof(j->s)) {
4225-				return STBI__MARKER_none;
4226-			}
4227-			x = stbi__get8(j->s);
4228-			if (x != 0x00 && x != 0xff) {
4229-				// not a stuffed zero or lead-in to another marker, looks
4230-				// like an actual marker, return it
4231-				return x;
4232-			}
4233-			// stuffed zero has x=0 now which ends the loop, meaning we go
4234-			// back to regular scan loop.
4235-			// repeated 0xff keeps trying to read the next byte of the marker.
4236-		}
4237-	}
4238-	return STBI__MARKER_none;
4239-}
4240-
4241-// decode image to YCbCr format
4242-static int
4243-stbi__decode_jpeg_image(stbi__jpeg *j)
4244-{
4245-	int m;
4246-	for (m = 0; m < 4; m++) {
4247-		j->img_comp[m].raw_data = NULL;
4248-		j->img_comp[m].raw_coeff = NULL;
4249-	}
4250-	j->restart_interval = 0;
4251-	if (!stbi__decode_jpeg_header(j, STBI__SCAN_load)) {
4252-		return 0;
4253-	}
4254-	m = stbi__get_marker(j);
4255-	while (!stbi__EOI(m)) {
4256-		if (stbi__SOS(m)) {
4257-			if (!stbi__process_scan_header(j)) {
4258-				return 0;
4259-			}
4260-			if (!stbi__parse_entropy_coded_data(j)) {
4261-				return 0;
4262-			}
4263-			if (j->marker == STBI__MARKER_none) {
4264-				j->marker = stbi__skip_jpeg_junk_at_end(j);
4265-				// if we reach eof without hitting a marker, stbi__get_marker()
4266-				// below will fail and we'll eventually return 0
4267-			}
4268-			m = stbi__get_marker(j);
4269-			if (STBI__RESTART(m)) {
4270-				m = stbi__get_marker(j);
4271-			}
4272-		} else if (stbi__DNL(m)) {
4273-			int Ld = stbi__get16be(j->s);
4274-			stbi__uint32 NL = stbi__get16be(j->s);
4275-			if (Ld != 4) {
4276-				return stbi__err("bad DNL len", "Corrupt JPEG");
4277-			}
4278-			if (NL != j->s->img_y) {
4279-				return stbi__err("bad DNL height", "Corrupt JPEG");
4280-			}
4281-			m = stbi__get_marker(j);
4282-		} else {
4283-			if (!stbi__process_marker(j, m)) {
4284-				return 1;
4285-			}
4286-			m = stbi__get_marker(j);
4287-		}
4288-	}
4289-	if (j->progressive) {
4290-		stbi__jpeg_finish(j);
4291-	}
4292-	return 1;
4293-}
4294-
4295-// static jfif-centered resampling (across block boundaries)
4296-
4297-typedef stbi_uc *(*resample_row_func)(stbi_uc *out, stbi_uc *in0, stbi_uc *in1,
4298-                                      int w, int hs);
4299-
4300-#define stbi__div4(x) ((stbi_uc)((x) >> 2))
4301-
4302-static stbi_uc *
4303-resample_row_1(stbi_uc *out, stbi_uc *in_near, stbi_uc *in_far, int w, int hs)
4304-{
4305-	STBI_NOTUSED(out);
4306-	STBI_NOTUSED(in_far);
4307-	STBI_NOTUSED(w);
4308-	STBI_NOTUSED(hs);
4309-	return in_near;
4310-}
4311-
4312-static stbi_uc *
4313-stbi__resample_row_v_2(stbi_uc *out, stbi_uc *in_near, stbi_uc *in_far, int w,
4314-                       int hs)
4315-{
4316-	// need to generate two samples vertically for every one in input
4317-	int i;
4318-	STBI_NOTUSED(hs);
4319-	for (i = 0; i < w; ++i) {
4320-		out[i] = stbi__div4(3 * in_near[i] + in_far[i] + 2);
4321-	}
4322-	return out;
4323-}
4324-
4325-static stbi_uc *
4326-stbi__resample_row_h_2(stbi_uc *out, stbi_uc *in_near, stbi_uc *in_far, int w,
4327-                       int hs)
4328-{
4329-	// need to generate two samples horizontally for every one in input
4330-	int i;
4331-	stbi_uc *input = in_near;
4332-
4333-	if (w == 1) {
4334-		// if only one sample, can't do any interpolation
4335-		out[0] = out[1] = input[0];
4336-		return out;
4337-	}
4338-
4339-	out[0] = input[0];
4340-	out[1] = stbi__div4(input[0] * 3 + input[1] + 2);
4341-	for (i = 1; i < w - 1; ++i) {
4342-		int n = 3 * input[i] + 2;
4343-		out[i * 2 + 0] = stbi__div4(n + input[i - 1]);
4344-		out[i * 2 + 1] = stbi__div4(n + input[i + 1]);
4345-	}
4346-	out[i * 2 + 0] = stbi__div4(input[w - 2] * 3 + input[w - 1] + 2);
4347-	out[i * 2 + 1] = input[w - 1];
4348-
4349-	STBI_NOTUSED(in_far);
4350-	STBI_NOTUSED(hs);
4351-
4352-	return out;
4353-}
4354-
4355-#define stbi__div16(x) ((stbi_uc)((x) >> 4))
4356-
4357-static stbi_uc *
4358-stbi__resample_row_hv_2(stbi_uc *out, stbi_uc *in_near, stbi_uc *in_far, int w,
4359-                        int hs)
4360-{
4361-	// need to generate 2x2 samples for every one in input
4362-	int i, t0, t1;
4363-	if (w == 1) {
4364-		out[0] = out[1] = stbi__div4(3 * in_near[0] + in_far[0] + 2);
4365-		return out;
4366-	}
4367-
4368-	t1 = 3 * in_near[0] + in_far[0];
4369-	out[0] = stbi__div4(t1 + 2);
4370-	for (i = 1; i < w; ++i) {
4371-		t0 = t1;
4372-		t1 = 3 * in_near[i] + in_far[i];
4373-		out[i * 2 - 1] = stbi__div16(3 * t0 + t1 + 8);
4374-		out[i * 2] = stbi__div16(3 * t1 + t0 + 8);
4375-	}
4376-	out[w * 2 - 1] = stbi__div4(t1 + 2);
4377-
4378-	STBI_NOTUSED(hs);
4379-
4380-	return out;
4381-}
4382-
4383-#if defined(STBI_SSE2) || defined(STBI_NEON)
4384-static stbi_uc *
4385-stbi__resample_row_hv_2_simd(stbi_uc *out, stbi_uc *in_near, stbi_uc *in_far,
4386-                             int w, int hs)
4387-{
4388-	// need to generate 2x2 samples for every one in input
4389-	int i = 0, t0, t1;
4390-
4391-	if (w == 1) {
4392-		out[0] = out[1] = stbi__div4(3 * in_near[0] + in_far[0] + 2);
4393-		return out;
4394-	}
4395-
4396-	t1 = 3 * in_near[0] + in_far[0];
4397-	// process groups of 8 pixels for as long as we can.
4398-	// note we can't handle the last pixel in a row in this loop
4399-	// because we need to handle the filter boundary conditions.
4400-	for (; i < ((w - 1) & ~7); i += 8) {
4401-#if defined(STBI_SSE2)
4402-		// load and perform the vertical filtering pass
4403-		// this uses 3*x + y = 4*x + (y - x)
4404-		__m128i zero = _mm_setzero_si128();
4405-		__m128i farb = _mm_loadl_epi64((__m128i *)(in_far + i));
4406-		__m128i nearb = _mm_loadl_epi64((__m128i *)(in_near + i));
4407-		__m128i farw = _mm_unpacklo_epi8(farb, zero);
4408-		__m128i nearw = _mm_unpacklo_epi8(nearb, zero);
4409-		__m128i diff = _mm_sub_epi16(farw, nearw);
4410-		__m128i nears = _mm_slli_epi16(nearw, 2);
4411-		__m128i curr = _mm_add_epi16(nears, diff); // current row
4412-
4413-		// horizontal filter works the same based on shifted vers of current
4414-		// row. "prev" is current row shifted right by 1 pixel; we need to
4415-		// insert the previous pixel value (from t1).
4416-		// "next" is current row shifted left by 1 pixel, with first pixel
4417-		// of next block of 8 pixels added in.
4418-		__m128i prv0 = _mm_slli_si128(curr, 2);
4419-		__m128i nxt0 = _mm_srli_si128(curr, 2);
4420-		__m128i prev = _mm_insert_epi16(prv0, t1, 0);
4421-		__m128i next =
4422-		    _mm_insert_epi16(nxt0, 3 * in_near[i + 8] + in_far[i + 8], 7);
4423-
4424-		// horizontal filter, polyphase implementation since it's convenient:
4425-		// even pixels = 3*cur + prev = cur*4 + (prev - cur)
4426-		// odd  pixels = 3*cur + next = cur*4 + (next - cur)
4427-		// note the shared term.
4428-		__m128i bias = _mm_set1_epi16(8);
4429-		__m128i curs = _mm_slli_epi16(curr, 2);
4430-		__m128i prvd = _mm_sub_epi16(prev, curr);
4431-		__m128i nxtd = _mm_sub_epi16(next, curr);
4432-		__m128i curb = _mm_add_epi16(curs, bias);
4433-		__m128i even = _mm_add_epi16(prvd, curb);
4434-		__m128i odd = _mm_add_epi16(nxtd, curb);
4435-
4436-		// interleave even and odd pixels, then undo scaling.
4437-		__m128i int0 = _mm_unpacklo_epi16(even, odd);
4438-		__m128i int1 = _mm_unpackhi_epi16(even, odd);
4439-		__m128i de0 = _mm_srli_epi16(int0, 4);
4440-		__m128i de1 = _mm_srli_epi16(int1, 4);
4441-
4442-		// pack and write output
4443-		__m128i outv = _mm_packus_epi16(de0, de1);
4444-		_mm_storeu_si128((__m128i *)(out + i * 2), outv);
4445-#elif defined(STBI_NEON)
4446-		// load and perform the vertical filtering pass
4447-		// this uses 3*x + y = 4*x + (y - x)
4448-		uint8x8_t farb = vld1_u8(in_far + i);
4449-		uint8x8_t nearb = vld1_u8(in_near + i);
4450-		int16x8_t diff = vreinterpretq_s16_u16(vsubl_u8(farb, nearb));
4451-		int16x8_t nears = vreinterpretq_s16_u16(vshll_n_u8(nearb, 2));
4452-		int16x8_t curr = vaddq_s16(nears, diff); // current row
4453-
4454-		// horizontal filter works the same based on shifted vers of current
4455-		// row. "prev" is current row shifted right by 1 pixel; we need to
4456-		// insert the previous pixel value (from t1).
4457-		// "next" is current row shifted left by 1 pixel, with first pixel
4458-		// of next block of 8 pixels added in.
4459-		int16x8_t prv0 = vextq_s16(curr, curr, 7);
4460-		int16x8_t nxt0 = vextq_s16(curr, curr, 1);
4461-		int16x8_t prev = vsetq_lane_s16(t1, prv0, 0);
4462-		int16x8_t next =
4463-		    vsetq_lane_s16(3 * in_near[i + 8] + in_far[i + 8], nxt0, 7);
4464-
4465-		// horizontal filter, polyphase implementation since it's convenient:
4466-		// even pixels = 3*cur + prev = cur*4 + (prev - cur)
4467-		// odd  pixels = 3*cur + next = cur*4 + (next - cur)
4468-		// note the shared term.
4469-		int16x8_t curs = vshlq_n_s16(curr, 2);
4470-		int16x8_t prvd = vsubq_s16(prev, curr);
4471-		int16x8_t nxtd = vsubq_s16(next, curr);
4472-		int16x8_t even = vaddq_s16(curs, prvd);
4473-		int16x8_t odd = vaddq_s16(curs, nxtd);
4474-
4475-		// undo scaling and round, then store with even/odd phases interleaved
4476-		uint8x8x2_t o;
4477-		o.val[0] = vqrshrun_n_s16(even, 4);
4478-		o.val[1] = vqrshrun_n_s16(odd, 4);
4479-		vst2_u8(out + i * 2, o);
4480-#endif
4481-
4482-		// "previous" value for next iter
4483-		t1 = 3 * in_near[i + 7] + in_far[i + 7];
4484-	}
4485-
4486-	t0 = t1;
4487-	t1 = 3 * in_near[i] + in_far[i];
4488-	out[i * 2] = stbi__div16(3 * t1 + t0 + 8);
4489-
4490-	for (++i; i < w; ++i) {
4491-		t0 = t1;
4492-		t1 = 3 * in_near[i] + in_far[i];
4493-		out[i * 2 - 1] = stbi__div16(3 * t0 + t1 + 8);
4494-		out[i * 2] = stbi__div16(3 * t1 + t0 + 8);
4495-	}
4496-	out[w * 2 - 1] = stbi__div4(t1 + 2);
4497-
4498-	STBI_NOTUSED(hs);
4499-
4500-	return out;
4501-}
4502-#endif
4503-
4504-static stbi_uc *
4505-stbi__resample_row_generic(stbi_uc *out, stbi_uc *in_near, stbi_uc *in_far,
4506-                           int w, int hs)
4507-{
4508-	// resample with nearest-neighbor
4509-	int i, j;
4510-	STBI_NOTUSED(in_far);
4511-	for (i = 0; i < w; ++i) {
4512-		for (j = 0; j < hs; ++j) {
4513-			out[i * hs + j] = in_near[i];
4514-		}
4515-	}
4516-	return out;
4517-}
4518-
4519-// this is a reduced-precision calculation of YCbCr-to-RGB introduced
4520-// to make sure the code produces the same results in both SIMD and scalar
4521-#define stbi__float2fixed(x) (((int)((x) * 4096.0f + 0.5f)) << 8)
4522-static void
4523-stbi__YCbCr_to_RGB_row(stbi_uc *out, const stbi_uc *y, const stbi_uc *pcb,
4524-                       const stbi_uc *pcr, int count, int step)
4525-{
4526-	int i;
4527-	for (i = 0; i < count; ++i) {
4528-		int y_fixed = (y[i] << 20) + (1 << 19); // rounding
4529-		int r, g, b;
4530-		int cr = pcr[i] - 128;
4531-		int cb = pcb[i] - 128;
4532-		r = y_fixed + cr * stbi__float2fixed(1.40200f);
4533-		g = y_fixed + (cr * -stbi__float2fixed(0.71414f)) +
4534-		    ((cb * -stbi__float2fixed(0.34414f)) & 0xffff0000);
4535-		b = y_fixed + cb * stbi__float2fixed(1.77200f);
4536-		r >>= 20;
4537-		g >>= 20;
4538-		b >>= 20;
4539-		if ((unsigned)r > 255) {
4540-			if (r < 0) {
4541-				r = 0;
4542-			} else {
4543-				r = 255;
4544-			}
4545-		}
4546-		if ((unsigned)g > 255) {
4547-			if (g < 0) {
4548-				g = 0;
4549-			} else {
4550-				g = 255;
4551-			}
4552-		}
4553-		if ((unsigned)b > 255) {
4554-			if (b < 0) {
4555-				b = 0;
4556-			} else {
4557-				b = 255;
4558-			}
4559-		}
4560-		out[0] = (stbi_uc)r;
4561-		out[1] = (stbi_uc)g;
4562-		out[2] = (stbi_uc)b;
4563-		out[3] = 255;
4564-		out += step;
4565-	}
4566-}
4567-
4568-#if defined(STBI_SSE2) || defined(STBI_NEON)
4569-static void
4570-stbi__YCbCr_to_RGB_simd(stbi_uc *out, stbi_uc const *y, stbi_uc const *pcb,
4571-                        stbi_uc const *pcr, int count, int step)
4572-{
4573-	int i = 0;
4574-
4575-#ifdef STBI_SSE2
4576-	// step == 3 is pretty ugly on the final interleave, and i'm not convinced
4577-	// it's useful in practice (you wouldn't use it for textures, for example).
4578-	// so just accelerate step == 4 case.
4579-	if (step == 4) {
4580-		// this is a fairly straightforward implementation and not
4581-		// super-optimized.
4582-		__m128i signflip = _mm_set1_epi8(-0x80);
4583-		__m128i cr_const0 = _mm_set1_epi16((short)(1.40200f * 4096.0f + 0.5f));
4584-		__m128i cr_const1 = _mm_set1_epi16(-(short)(0.71414f * 4096.0f + 0.5f));
4585-		__m128i cb_const0 = _mm_set1_epi16(-(short)(0.34414f * 4096.0f + 0.5f));
4586-		__m128i cb_const1 = _mm_set1_epi16((short)(1.77200f * 4096.0f + 0.5f));
4587-		__m128i y_bias = _mm_set1_epi8((char)(unsigned char)128);
4588-		__m128i xw = _mm_set1_epi16(255); // alpha channel
4589-
4590-		for (; i + 7 < count; i += 8) {
4591-			// load
4592-			__m128i y_bytes = _mm_loadl_epi64((__m128i *)(y + i));
4593-			__m128i cr_bytes = _mm_loadl_epi64((__m128i *)(pcr + i));
4594-			__m128i cb_bytes = _mm_loadl_epi64((__m128i *)(pcb + i));
4595-			__m128i cr_biased = _mm_xor_si128(cr_bytes, signflip); // -128
4596-			__m128i cb_biased = _mm_xor_si128(cb_bytes, signflip); // -128
4597-
4598-			// unpack to short (and left-shift cr, cb by 8)
4599-			__m128i yw = _mm_unpacklo_epi8(y_bias, y_bytes);
4600-			__m128i crw = _mm_unpacklo_epi8(_mm_setzero_si128(), cr_biased);
4601-			__m128i cbw = _mm_unpacklo_epi8(_mm_setzero_si128(), cb_biased);
4602-
4603-			// color transform
4604-			__m128i yws = _mm_srli_epi16(yw, 4);
4605-			__m128i cr0 = _mm_mulhi_epi16(cr_const0, crw);
4606-			__m128i cb0 = _mm_mulhi_epi16(cb_const0, cbw);
4607-			__m128i cb1 = _mm_mulhi_epi16(cbw, cb_const1);
4608-			__m128i cr1 = _mm_mulhi_epi16(crw, cr_const1);
4609-			__m128i rws = _mm_add_epi16(cr0, yws);
4610-			__m128i gwt = _mm_add_epi16(cb0, yws);
4611-			__m128i bws = _mm_add_epi16(yws, cb1);
4612-			__m128i gws = _mm_add_epi16(gwt, cr1);
4613-
4614-			// descale
4615-			__m128i rw = _mm_srai_epi16(rws, 4);
4616-			__m128i bw = _mm_srai_epi16(bws, 4);
4617-			__m128i gw = _mm_srai_epi16(gws, 4);
4618-
4619-			// back to byte, set up for transpose
4620-			__m128i brb = _mm_packus_epi16(rw, bw);
4621-			__m128i gxb = _mm_packus_epi16(gw, xw);
4622-
4623-			// transpose to interleave channels
4624-			__m128i t0 = _mm_unpacklo_epi8(brb, gxb);
4625-			__m128i t1 = _mm_unpackhi_epi8(brb, gxb);
4626-			__m128i o0 = _mm_unpacklo_epi16(t0, t1);
4627-			__m128i o1 = _mm_unpackhi_epi16(t0, t1);
4628-
4629-			// store
4630-			_mm_storeu_si128((__m128i *)(out + 0), o0);
4631-			_mm_storeu_si128((__m128i *)(out + 16), o1);
4632-			out += 32;
4633-		}
4634-	}
4635-#endif
4636-
4637-#ifdef STBI_NEON
4638-	// in this version, step=3 support would be easy to add. but is there
4639-	// demand?
4640-	if (step == 4) {
4641-		// this is a fairly straightforward implementation and not
4642-		// super-optimized.
4643-		uint8x8_t signflip = vdup_n_u8(0x80);
4644-		int16x8_t cr_const0 = vdupq_n_s16((short)(1.40200f * 4096.0f + 0.5f));
4645-		int16x8_t cr_const1 = vdupq_n_s16(-(short)(0.71414f * 4096.0f + 0.5f));
4646-		int16x8_t cb_const0 = vdupq_n_s16(-(short)(0.34414f * 4096.0f + 0.5f));
4647-		int16x8_t cb_const1 = vdupq_n_s16((short)(1.77200f * 4096.0f + 0.5f));
4648-
4649-		for (; i + 7 < count; i += 8) {
4650-			// load
4651-			uint8x8_t y_bytes = vld1_u8(y + i);
4652-			uint8x8_t cr_bytes = vld1_u8(pcr + i);
4653-			uint8x8_t cb_bytes = vld1_u8(pcb + i);
4654-			int8x8_t cr_biased =
4655-			    vreinterpret_s8_u8(vsub_u8(cr_bytes, signflip));
4656-			int8x8_t cb_biased =
4657-			    vreinterpret_s8_u8(vsub_u8(cb_bytes, signflip));
4658-
4659-			// expand to s16
4660-			int16x8_t yws = vreinterpretq_s16_u16(vshll_n_u8(y_bytes, 4));
4661-			int16x8_t crw = vshll_n_s8(cr_biased, 7);
4662-			int16x8_t cbw = vshll_n_s8(cb_biased, 7);
4663-
4664-			// color transform
4665-			int16x8_t cr0 = vqdmulhq_s16(crw, cr_const0);
4666-			int16x8_t cb0 = vqdmulhq_s16(cbw, cb_const0);
4667-			int16x8_t cr1 = vqdmulhq_s16(crw, cr_const1);
4668-			int16x8_t cb1 = vqdmulhq_s16(cbw, cb_const1);
4669-			int16x8_t rws = vaddq_s16(yws, cr0);
4670-			int16x8_t gws = vaddq_s16(vaddq_s16(yws, cb0), cr1);
4671-			int16x8_t bws = vaddq_s16(yws, cb1);
4672-
4673-			// undo scaling, round, convert to byte
4674-			uint8x8x4_t o;
4675-			o.val[0] = vqrshrun_n_s16(rws, 4);
4676-			o.val[1] = vqrshrun_n_s16(gws, 4);
4677-			o.val[2] = vqrshrun_n_s16(bws, 4);
4678-			o.val[3] = vdup_n_u8(255);
4679-
4680-			// store, interleaving r/g/b/a
4681-			vst4_u8(out, o);
4682-			out += 8 * 4;
4683-		}
4684-	}
4685-#endif
4686-
4687-	for (; i < count; ++i) {
4688-		int y_fixed = (y[i] << 20) + (1 << 19); // rounding
4689-		int r, g, b;
4690-		int cr = pcr[i] - 128;
4691-		int cb = pcb[i] - 128;
4692-		r = y_fixed + cr * stbi__float2fixed(1.40200f);
4693-		g = y_fixed + cr * -stbi__float2fixed(0.71414f) +
4694-		    ((cb * -stbi__float2fixed(0.34414f)) & 0xffff0000);
4695-		b = y_fixed + cb * stbi__float2fixed(1.77200f);
4696-		r >>= 20;
4697-		g >>= 20;
4698-		b >>= 20;
4699-		if ((unsigned)r > 255) {
4700-			if (r < 0) {
4701-				r = 0;
4702-			} else {
4703-				r = 255;
4704-			}
4705-		}
4706-		if ((unsigned)g > 255) {
4707-			if (g < 0) {
4708-				g = 0;
4709-			} else {
4710-				g = 255;
4711-			}
4712-		}
4713-		if ((unsigned)b > 255) {
4714-			if (b < 0) {
4715-				b = 0;
4716-			} else {
4717-				b = 255;
4718-			}
4719-		}
4720-		out[0] = (stbi_uc)r;
4721-		out[1] = (stbi_uc)g;
4722-		out[2] = (stbi_uc)b;
4723-		out[3] = 255;
4724-		out += step;
4725-	}
4726-}
4727-#endif
4728-
4729-// set up the kernels
4730-static void
4731-stbi__setup_jpeg(stbi__jpeg *j)
4732-{
4733-	j->idct_block_kernel = stbi__idct_block;
4734-	j->YCbCr_to_RGB_kernel = stbi__YCbCr_to_RGB_row;
4735-	j->resample_row_hv_2_kernel = stbi__resample_row_hv_2;
4736-
4737-#ifdef STBI_SSE2
4738-	if (stbi__sse2_available()) {
4739-		j->idct_block_kernel = stbi__idct_simd;
4740-		j->YCbCr_to_RGB_kernel = stbi__YCbCr_to_RGB_simd;
4741-		j->resample_row_hv_2_kernel = stbi__resample_row_hv_2_simd;
4742-	}
4743-#endif
4744-
4745-#ifdef STBI_NEON
4746-	j->idct_block_kernel = stbi__idct_simd;
4747-	j->YCbCr_to_RGB_kernel = stbi__YCbCr_to_RGB_simd;
4748-	j->resample_row_hv_2_kernel = stbi__resample_row_hv_2_simd;
4749-#endif
4750-}
4751-
4752-// clean up the temporary component buffers
4753-static void
4754-stbi__cleanup_jpeg(stbi__jpeg *j)
4755-{
4756-	stbi__free_jpeg_components(j, j->s->img_n, 0);
4757-}
4758-
4759-typedef struct {
4760-	resample_row_func resample;
4761-	stbi_uc *line0, *line1;
4762-	int hs, vs;  // expansion factor in each axis
4763-	int w_lores; // horizontal pixels pre-expansion
4764-	int ystep;   // how far through vertical expansion we are
4765-	int ypos;    // which pre-expansion row we're on
4766-} stbi__resample;
4767-
4768-// fast 0..255 * 0..255 => 0..255 rounded multiplication
4769-static stbi_uc
4770-stbi__blinn_8x8(stbi_uc x, stbi_uc y)
4771-{
4772-	unsigned int t = x * y + 128;
4773-	return (stbi_uc)((t + (t >> 8)) >> 8);
4774-}
4775-
4776-static stbi_uc *
4777-load_jpeg_image(stbi__jpeg *z, int *out_x, int *out_y, int *comp, int req_comp)
4778-{
4779-	int n, decode_n, is_rgb;
4780-	z->s->img_n = 0; // make stbi__cleanup_jpeg safe
4781-
4782-	// validate req_comp
4783-	if (req_comp < 0 || req_comp > 4) {
4784-		return stbi__errpuc("bad req_comp", "Internal error");
4785-	}
4786-
4787-	// load a jpeg image from whichever source, but leave in YCbCr format
4788-	if (!stbi__decode_jpeg_image(z)) {
4789-		stbi__cleanup_jpeg(z);
4790-		return NULL;
4791-	}
4792-
4793-	// determine actual number of components to generate
4794-	n = req_comp ? req_comp : z->s->img_n >= 3 ? 3 : 1;
4795-
4796-	is_rgb = z->s->img_n == 3 &&
4797-	         (z->rgb == 3 || (z->app14_color_transform == 0 && !z->jfif));
4798-
4799-	if (z->s->img_n == 3 && n < 3 && !is_rgb) {
4800-		decode_n = 1;
4801-	} else {
4802-		decode_n = z->s->img_n;
4803-	}
4804-
4805-	// nothing to do if no components requested; check this now to avoid
4806-	// accessing uninitialized coutput[0] later
4807-	if (decode_n <= 0) {
4808-		stbi__cleanup_jpeg(z);
4809-		return NULL;
4810-	}
4811-
4812-	// resample and color-convert
4813-	{
4814-		int k;
4815-		unsigned int i, j;
4816-		stbi_uc *output;
4817-		stbi_uc *coutput[4] = {NULL, NULL, NULL, NULL};
4818-
4819-		stbi__resample res_comp[4];
4820-
4821-		for (k = 0; k < decode_n; ++k) {
4822-			stbi__resample *r = &res_comp[k];
4823-
4824-			// allocate line buffer big enough for upsampling off the edges
4825-			// with upsample factor of 4
4826-			z->img_comp[k].linebuf = (stbi_uc *)stbi__malloc(z->s->img_x + 3);
4827-			if (!z->img_comp[k].linebuf) {
4828-				stbi__cleanup_jpeg(z);
4829-				return stbi__errpuc("outofmem", "Out of memory");
4830-			}
4831-
4832-			r->hs = z->img_h_max / z->img_comp[k].h;
4833-			r->vs = z->img_v_max / z->img_comp[k].v;
4834-			r->ystep = r->vs >> 1;
4835-			r->w_lores = (z->s->img_x + r->hs - 1) / r->hs;
4836-			r->ypos = 0;
4837-			r->line0 = r->line1 = z->img_comp[k].data;
4838-
4839-			if (r->hs == 1 && r->vs == 1) {
4840-				r->resample = resample_row_1;
4841-			} else if (r->hs == 1 && r->vs == 2) {
4842-				r->resample = stbi__resample_row_v_2;
4843-			} else if (r->hs == 2 && r->vs == 1) {
4844-				r->resample = stbi__resample_row_h_2;
4845-			} else if (r->hs == 2 && r->vs == 2) {
4846-				r->resample = z->resample_row_hv_2_kernel;
4847-			} else {
4848-				r->resample = stbi__resample_row_generic;
4849-			}
4850-		}
4851-
4852-		// can't error after this so, this is safe
4853-		output = (stbi_uc *)stbi__malloc_mad3(n, z->s->img_x, z->s->img_y, 1);
4854-		if (!output) {
4855-			stbi__cleanup_jpeg(z);
4856-			return stbi__errpuc("outofmem", "Out of memory");
4857-		}
4858-
4859-		// now go ahead and resample
4860-		for (j = 0; j < z->s->img_y; ++j) {
4861-			stbi_uc *out = output + n * z->s->img_x * j;
4862-			for (k = 0; k < decode_n; ++k) {
4863-				stbi__resample *r = &res_comp[k];
4864-				int y_bot = r->ystep >= (r->vs >> 1);
4865-				coutput[k] = r->resample(
4866-				    z->img_comp[k].linebuf, y_bot ? r->line1 : r->line0,
4867-				    y_bot ? r->line0 : r->line1, r->w_lores, r->hs);
4868-				if (++r->ystep >= r->vs) {
4869-					r->ystep = 0;
4870-					r->line0 = r->line1;
4871-					if (++r->ypos < z->img_comp[k].y) {
4872-						r->line1 += z->img_comp[k].w2;
4873-					}
4874-				}
4875-			}
4876-			if (n >= 3) {
4877-				stbi_uc *y = coutput[0];
4878-				if (z->s->img_n == 3) {
4879-					if (is_rgb) {
4880-						for (i = 0; i < z->s->img_x; ++i) {
4881-							out[0] = y[i];
4882-							out[1] = coutput[1][i];
4883-							out[2] = coutput[2][i];
4884-							out[3] = 255;
4885-							out += n;
4886-						}
4887-					} else {
4888-						z->YCbCr_to_RGB_kernel(out, y, coutput[1], coutput[2],
4889-						                       z->s->img_x, n);
4890-					}
4891-				} else if (z->s->img_n == 4) {
4892-					if (z->app14_color_transform == 0) { // CMYK
4893-						for (i = 0; i < z->s->img_x; ++i) {
4894-							stbi_uc m = coutput[3][i];
4895-							out[0] = stbi__blinn_8x8(coutput[0][i], m);
4896-							out[1] = stbi__blinn_8x8(coutput[1][i], m);
4897-							out[2] = stbi__blinn_8x8(coutput[2][i], m);
4898-							out[3] = 255;
4899-							out += n;
4900-						}
4901-					} else if (z->app14_color_transform == 2) { // YCCK
4902-						z->YCbCr_to_RGB_kernel(out, y, coutput[1], coutput[2],
4903-						                       z->s->img_x, n);
4904-						for (i = 0; i < z->s->img_x; ++i) {
4905-							stbi_uc m = coutput[3][i];
4906-							out[0] = stbi__blinn_8x8(255 - out[0], m);
4907-							out[1] = stbi__blinn_8x8(255 - out[1], m);
4908-							out[2] = stbi__blinn_8x8(255 - out[2], m);
4909-							out += n;
4910-						}
4911-					} else { // YCbCr + alpha?  Ignore the fourth channel for
4912-						     // now
4913-						z->YCbCr_to_RGB_kernel(out, y, coutput[1], coutput[2],
4914-						                       z->s->img_x, n);
4915-					}
4916-				} else {
4917-					for (i = 0; i < z->s->img_x; ++i) {
4918-						out[0] = out[1] = out[2] = y[i];
4919-						out[3] = 255; // not used if n==3
4920-						out += n;
4921-					}
4922-				}
4923-			} else {
4924-				if (is_rgb) {
4925-					if (n == 1) {
4926-						for (i = 0; i < z->s->img_x; ++i) {
4927-							*out++ = stbi__compute_y(
4928-							    coutput[0][i], coutput[1][i], coutput[2][i]);
4929-						}
4930-					} else {
4931-						for (i = 0; i < z->s->img_x; ++i, out += 2) {
4932-							out[0] = stbi__compute_y(
4933-							    coutput[0][i], coutput[1][i], coutput[2][i]);
4934-							out[1] = 255;
4935-						}
4936-					}
4937-				} else if (z->s->img_n == 4 && z->app14_color_transform == 0) {
4938-					for (i = 0; i < z->s->img_x; ++i) {
4939-						stbi_uc m = coutput[3][i];
4940-						stbi_uc r = stbi__blinn_8x8(coutput[0][i], m);
4941-						stbi_uc g = stbi__blinn_8x8(coutput[1][i], m);
4942-						stbi_uc b = stbi__blinn_8x8(coutput[2][i], m);
4943-						out[0] = stbi__compute_y(r, g, b);
4944-						out[1] = 255;
4945-						out += n;
4946-					}
4947-				} else if (z->s->img_n == 4 && z->app14_color_transform == 2) {
4948-					for (i = 0; i < z->s->img_x; ++i) {
4949-						out[0] =
4950-						    stbi__blinn_8x8(255 - coutput[0][i], coutput[3][i]);
4951-						out[1] = 255;
4952-						out += n;
4953-					}
4954-				} else {
4955-					stbi_uc *y = coutput[0];
4956-					if (n == 1) {
4957-						for (i = 0; i < z->s->img_x; ++i) {
4958-							out[i] = y[i];
4959-						}
4960-					} else {
4961-						for (i = 0; i < z->s->img_x; ++i) {
4962-							*out++ = y[i];
4963-							*out++ = 255;
4964-						}
4965-					}
4966-				}
4967-			}
4968-		}
4969-		stbi__cleanup_jpeg(z);
4970-		*out_x = z->s->img_x;
4971-		*out_y = z->s->img_y;
4972-		if (comp) {
4973-			*comp = z->s->img_n >= 3
4974-			            ? 3
4975-			            : 1; // report original components, not output
4976-		}
4977-		return output;
4978-	}
4979-}
4980-
4981-static void *
4982-stbi__jpeg_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
4983-                stbi__result_info *ri)
4984-{
4985-	unsigned char *result;
4986-	stbi__jpeg *j = (stbi__jpeg *)stbi__malloc(sizeof(stbi__jpeg));
4987-	if (!j) {
4988-		return stbi__errpuc("outofmem", "Out of memory");
4989-	}
4990-	memset(j, 0, sizeof(stbi__jpeg));
4991-	STBI_NOTUSED(ri);
4992-	j->s = s;
4993-	stbi__setup_jpeg(j);
4994-	result = load_jpeg_image(j, x, y, comp, req_comp);
4995-	STBI_FREE(j);
4996-	return result;
4997-}
4998-
4999-static int
5000-stbi__jpeg_test(stbi__context *s)
5001-{
5002-	int r;
5003-	stbi__jpeg *j = (stbi__jpeg *)stbi__malloc(sizeof(stbi__jpeg));
5004-	if (!j) {
5005-		return stbi__err("outofmem", "Out of memory");
5006-	}
5007-	memset(j, 0, sizeof(stbi__jpeg));
5008-	j->s = s;
5009-	stbi__setup_jpeg(j);
5010-	r = stbi__decode_jpeg_header(j, STBI__SCAN_type);
5011-	stbi__rewind(s);
5012-	STBI_FREE(j);
5013-	return r;
5014-}
5015-
5016-static int
5017-stbi__jpeg_info_raw(stbi__jpeg *j, int *x, int *y, int *comp)
5018-{
5019-	if (!stbi__decode_jpeg_header(j, STBI__SCAN_header)) {
5020-		stbi__rewind(j->s);
5021-		return 0;
5022-	}
5023-	if (x) {
5024-		*x = j->s->img_x;
5025-	}
5026-	if (y) {
5027-		*y = j->s->img_y;
5028-	}
5029-	if (comp) {
5030-		*comp = j->s->img_n >= 3 ? 3 : 1;
5031-	}
5032-	return 1;
5033-}
5034-
5035-static int
5036-stbi__jpeg_info(stbi__context *s, int *x, int *y, int *comp)
5037-{
5038-	int result;
5039-	stbi__jpeg *j = (stbi__jpeg *)(stbi__malloc(sizeof(stbi__jpeg)));
5040-	if (!j) {
5041-		return stbi__err("outofmem", "Out of memory");
5042-	}
5043-	memset(j, 0, sizeof(stbi__jpeg));
5044-	j->s = s;
5045-	result = stbi__jpeg_info_raw(j, x, y, comp);
5046-	STBI_FREE(j);
5047-	return result;
5048-}
5049-#endif
5050-
5051-// public domain zlib decode    v0.2  Sean Barrett 2006-11-18
5052-//    simple implementation
5053-//      - all input must be provided in an upfront buffer
5054-//      - all output is written to a single output buffer (can malloc/realloc)
5055-//    performance
5056-//      - fast huffman
5057-
5058-#ifndef STBI_NO_ZLIB
5059-
5060-// fast-way is faster to check than jpeg huffman, but slow way is slower
5061-#define STBI__ZFAST_BITS 9 // accelerate all cases in default tables
5062-#define STBI__ZFAST_MASK ((1 << STBI__ZFAST_BITS) - 1)
5063-#define STBI__ZNSYMS 288 // number of symbols in literal/length alphabet
5064-
5065-// zlib-style huffman encoding
5066-// (jpegs packs from left, zlib from right, so can't share code)
5067-typedef struct {
5068-	stbi__uint16 fast[1 << STBI__ZFAST_BITS];
5069-	stbi__uint16 firstcode[16];
5070-	int maxcode[17];
5071-	stbi__uint16 firstsymbol[16];
5072-	stbi_uc size[STBI__ZNSYMS];
5073-	stbi__uint16 value[STBI__ZNSYMS];
5074-} stbi__zhuffman;
5075-
5076-stbi_inline static int
5077-stbi__bitreverse16(int n)
5078-{
5079-	n = ((n & 0xAAAA) >> 1) | ((n & 0x5555) << 1);
5080-	n = ((n & 0xCCCC) >> 2) | ((n & 0x3333) << 2);
5081-	n = ((n & 0xF0F0) >> 4) | ((n & 0x0F0F) << 4);
5082-	n = ((n & 0xFF00) >> 8) | ((n & 0x00FF) << 8);
5083-	return n;
5084-}
5085-
5086-stbi_inline static int
5087-stbi__bit_reverse(int v, int bits)
5088-{
5089-	STBI_ASSERT(bits <= 16);
5090-	// to bit reverse n bits, reverse 16 and shift
5091-	// e.g. 11 bits, bit reverse and shift away 5
5092-	return stbi__bitreverse16(v) >> (16 - bits);
5093-}
5094-
5095-static int
5096-stbi__zbuild_huffman(stbi__zhuffman *z, const stbi_uc *sizelist, int num)
5097-{
5098-	int i, k = 0;
5099-	int code, next_code[16], sizes[17];
5100-
5101-	// DEFLATE spec for generating codes
5102-	memset(sizes, 0, sizeof(sizes));
5103-	memset(z->fast, 0, sizeof(z->fast));
5104-	for (i = 0; i < num; ++i) {
5105-		++sizes[sizelist[i]];
5106-	}
5107-	sizes[0] = 0;
5108-	for (i = 1; i < 16; ++i) {
5109-		if (sizes[i] > (1 << i)) {
5110-			return stbi__err("bad sizes", "Corrupt PNG");
5111-		}
5112-	}
5113-	code = 0;
5114-	for (i = 1; i < 16; ++i) {
5115-		next_code[i] = code;
5116-		z->firstcode[i] = (stbi__uint16)code;
5117-		z->firstsymbol[i] = (stbi__uint16)k;
5118-		code = (code + sizes[i]);
5119-		if (sizes[i]) {
5120-			if (code - 1 >= (1 << i)) {
5121-				return stbi__err("bad codelengths", "Corrupt PNG");
5122-			}
5123-		}
5124-		z->maxcode[i] = code << (16 - i); // preshift for inner loop
5125-		code <<= 1;
5126-		k += sizes[i];
5127-	}
5128-	z->maxcode[16] = 0x10000; // sentinel
5129-	for (i = 0; i < num; ++i) {
5130-		int s = sizelist[i];
5131-		if (s) {
5132-			int c = next_code[s] - z->firstcode[s] + z->firstsymbol[s];
5133-			stbi__uint16 fastv = (stbi__uint16)((s << 9) | i);
5134-			z->size[c] = (stbi_uc)s;
5135-			z->value[c] = (stbi__uint16)i;
5136-			if (s <= STBI__ZFAST_BITS) {
5137-				int j = stbi__bit_reverse(next_code[s], s);
5138-				while (j < (1 << STBI__ZFAST_BITS)) {
5139-					z->fast[j] = fastv;
5140-					j += (1 << s);
5141-				}
5142-			}
5143-			++next_code[s];
5144-		}
5145-	}
5146-	return 1;
5147-}
5148-
5149-// zlib-from-memory implementation for PNG reading
5150-//    because PNG allows splitting the zlib stream arbitrarily,
5151-//    and it's annoying structurally to have PNG call ZLIB call PNG,
5152-//    we require PNG read all the IDATs and combine them into a single
5153-//    memory buffer
5154-
5155-typedef struct {
5156-	stbi_uc *zbuffer, *zbuffer_end;
5157-	int num_bits;
5158-	int hit_zeof_once;
5159-	stbi__uint32 code_buffer;
5160-
5161-	char *zout;
5162-	char *zout_start;
5163-	char *zout_end;
5164-	int z_expandable;
5165-
5166-	stbi__zhuffman z_length, z_distance;
5167-} stbi__zbuf;
5168-
5169-stbi_inline static int
5170-stbi__zeof(stbi__zbuf *z)
5171-{
5172-	return (z->zbuffer >= z->zbuffer_end);
5173-}
5174-
5175-stbi_inline static stbi_uc
5176-stbi__zget8(stbi__zbuf *z)
5177-{
5178-	return stbi__zeof(z) ? 0 : *z->zbuffer++;
5179-}
5180-
5181-static void
5182-stbi__fill_bits(stbi__zbuf *z)
5183-{
5184-	do {
5185-		if (z->code_buffer >= (1U << z->num_bits)) {
5186-			z->zbuffer = z->zbuffer_end; /* treat this as EOF so we fail. */
5187-			return;
5188-		}
5189-		z->code_buffer |= (unsigned int)stbi__zget8(z) << z->num_bits;
5190-		z->num_bits += 8;
5191-	} while (z->num_bits <= 24);
5192-}
5193-
5194-stbi_inline static unsigned int
5195-stbi__zreceive(stbi__zbuf *z, int n)
5196-{
5197-	unsigned int k;
5198-	if (z->num_bits < n) {
5199-		stbi__fill_bits(z);
5200-	}
5201-	k = z->code_buffer & ((1 << n) - 1);
5202-	z->code_buffer >>= n;
5203-	z->num_bits -= n;
5204-	return k;
5205-}
5206-
5207-static int
5208-stbi__zhuffman_decode_slowpath(stbi__zbuf *a, stbi__zhuffman *z)
5209-{
5210-	int b, s, k;
5211-	// not resolved by fast table, so compute it the slow way
5212-	// use jpeg approach, which requires MSbits at top
5213-	k = stbi__bit_reverse(a->code_buffer, 16);
5214-	for (s = STBI__ZFAST_BITS + 1;; ++s) {
5215-		if (k < z->maxcode[s]) {
5216-			break;
5217-		}
5218-	}
5219-	if (s >= 16) {
5220-		return -1; // invalid code!
5221-	}
5222-	// code size is s, so:
5223-	b = (k >> (16 - s)) - z->firstcode[s] + z->firstsymbol[s];
5224-	if (b >= STBI__ZNSYMS) {
5225-		return -1; // some data was corrupt somewhere!
5226-	}
5227-	if (z->size[b] != s) {
5228-		return -1; // was originally an assert, but report failure instead.
5229-	}
5230-	a->code_buffer >>= s;
5231-	a->num_bits -= s;
5232-	return z->value[b];
5233-}
5234-
5235-stbi_inline static int
5236-stbi__zhuffman_decode(stbi__zbuf *a, stbi__zhuffman *z)
5237-{
5238-	int b, s;
5239-	if (a->num_bits < 16) {
5240-		if (stbi__zeof(a)) {
5241-			if (!a->hit_zeof_once) {
5242-				// This is the first time we hit eof, insert 16 extra padding
5243-				// btis to allow us to keep going; if we actually consume any of
5244-				// them though, that is invalid data. This is caught later.
5245-				a->hit_zeof_once = 1;
5246-				a->num_bits += 16; // add 16 implicit zero bits
5247-			} else {
5248-				// We already inserted our extra 16 padding bits and are again
5249-				// out, this stream is actually prematurely terminated.
5250-				return -1;
5251-			}
5252-		} else {
5253-			stbi__fill_bits(a);
5254-		}
5255-	}
5256-	b = z->fast[a->code_buffer & STBI__ZFAST_MASK];
5257-	if (b) {
5258-		s = b >> 9;
5259-		a->code_buffer >>= s;
5260-		a->num_bits -= s;
5261-		return b & 511;
5262-	}
5263-	return stbi__zhuffman_decode_slowpath(a, z);
5264-}
5265-
5266-static int
5267-stbi__zexpand(stbi__zbuf *z, char *zout, int n) // need to make room for n bytes
5268-{
5269-	char *q;
5270-	unsigned int cur, limit, old_limit;
5271-	z->zout = zout;
5272-	if (!z->z_expandable) {
5273-		return stbi__err("output buffer limit", "Corrupt PNG");
5274-	}
5275-	cur = (unsigned int)(z->zout - z->zout_start);
5276-	limit = old_limit = (unsigned)(z->zout_end - z->zout_start);
5277-	if (UINT_MAX - cur < (unsigned)n) {
5278-		return stbi__err("outofmem", "Out of memory");
5279-	}
5280-	while (cur + n > limit) {
5281-		if (limit > UINT_MAX / 2) {
5282-			return stbi__err("outofmem", "Out of memory");
5283-		}
5284-		limit *= 2;
5285-	}
5286-	q = (char *)STBI_REALLOC_SIZED(z->zout_start, old_limit, limit);
5287-	STBI_NOTUSED(old_limit);
5288-	if (q == NULL) {
5289-		return stbi__err("outofmem", "Out of memory");
5290-	}
5291-	z->zout_start = q;
5292-	z->zout = q + cur;
5293-	z->zout_end = q + limit;
5294-	return 1;
5295-}
5296-
5297-static const int stbi__zlength_base[31] = {
5298-    3,  4,  5,  6,  7,  8,  9,  10,  11,  13,  15,  17,  19,  23, 27, 31,
5299-    35, 43, 51, 59, 67, 83, 99, 115, 131, 163, 195, 227, 258, 0,  0};
5300-
5301-static const int stbi__zlength_extra[31] = {0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
5302-                                            1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4,
5303-                                            4, 4, 5, 5, 5, 5, 0, 0, 0};
5304-
5305-static const int stbi__zdist_base[32] = {
5306-    1,    2,    3,    4,    5,    7,     9,     13,    17,  25,   33,
5307-    49,   65,   97,   129,  193,  257,   385,   513,   769, 1025, 1537,
5308-    2049, 3073, 4097, 6145, 8193, 12289, 16385, 24577, 0,   0};
5309-
5310-static const int stbi__zdist_extra[32] = {0, 0, 0,  0,  1,  1,  2,  2,  3,  3,
5311-                                          4, 4, 5,  5,  6,  6,  7,  7,  8,  8,
5312-                                          9, 9, 10, 10, 11, 11, 12, 12, 13, 13};
5313-
5314-static int
5315-stbi__parse_huffman_block(stbi__zbuf *a)
5316-{
5317-	char *zout = a->zout;
5318-	for (;;) {
5319-		int z = stbi__zhuffman_decode(a, &a->z_length);
5320-		if (z < 256) {
5321-			if (z < 0) {
5322-				return stbi__err("bad huffman code",
5323-				                 "Corrupt PNG"); // error in huffman codes
5324-			}
5325-			if (zout >= a->zout_end) {
5326-				if (!stbi__zexpand(a, zout, 1)) {
5327-					return 0;
5328-				}
5329-				zout = a->zout;
5330-			}
5331-			*zout++ = (char)z;
5332-		} else {
5333-			stbi_uc *p;
5334-			int len, dist;
5335-			if (z == 256) {
5336-				a->zout = zout;
5337-				if (a->hit_zeof_once && a->num_bits < 16) {
5338-					// The first time we hit zeof, we inserted 16 extra zero
5339-					// bits into our bit buffer so the decoder can just do its
5340-					// speculative decoding. But if we actually consumed any of
5341-					// those bits (which is the case when num_bits < 16), the
5342-					// stream actually read past the end so it is malformed.
5343-					return stbi__err("unexpected end", "Corrupt PNG");
5344-				}
5345-				return 1;
5346-			}
5347-			if (z >= 286) {
5348-				return stbi__err(
5349-				    "bad huffman code",
5350-				    "Corrupt PNG"); // per DEFLATE, length codes 286 and 287
5351-				                    // must not appear in compressed data
5352-			}
5353-			z -= 257;
5354-			len = stbi__zlength_base[z];
5355-			if (stbi__zlength_extra[z]) {
5356-				len += stbi__zreceive(a, stbi__zlength_extra[z]);
5357-			}
5358-			z = stbi__zhuffman_decode(a, &a->z_distance);
5359-			if (z < 0 || z >= 30) {
5360-				return stbi__err(
5361-				    "bad huffman code",
5362-				    "Corrupt PNG"); // per DEFLATE, distance codes 30 and 31
5363-				                    // must not appear in compressed data
5364-			}
5365-			dist = stbi__zdist_base[z];
5366-			if (stbi__zdist_extra[z]) {
5367-				dist += stbi__zreceive(a, stbi__zdist_extra[z]);
5368-			}
5369-			if (zout - a->zout_start < dist) {
5370-				return stbi__err("bad dist", "Corrupt PNG");
5371-			}
5372-			if (len > a->zout_end - zout) {
5373-				if (!stbi__zexpand(a, zout, len)) {
5374-					return 0;
5375-				}
5376-				zout = a->zout;
5377-			}
5378-			p = (stbi_uc *)(zout - dist);
5379-			if (dist == 1) { // run of one byte; common in images.
5380-				stbi_uc v = *p;
5381-				if (len) {
5382-					do {
5383-						*zout++ = v;
5384-					} while (--len);
5385-				}
5386-			} else {
5387-				if (len) {
5388-					do {
5389-						*zout++ = *p++;
5390-					} while (--len);
5391-				}
5392-			}
5393-		}
5394-	}
5395-}
5396-
5397-static int
5398-stbi__compute_huffman_codes(stbi__zbuf *a)
5399-{
5400-	static const stbi_uc length_dezigzag[19] = {
5401-	    16, 17, 18, 0, 8, 7, 9, 6, 10, 5, 11, 4, 12, 3, 13, 2, 14, 1, 15};
5402-	stbi__zhuffman z_codelength;
5403-	stbi_uc lencodes[286 + 32 + 137]; // padding for maximum single op
5404-	stbi_uc codelength_sizes[19];
5405-	int i, n;
5406-
5407-	int hlit = stbi__zreceive(a, 5) + 257;
5408-	int hdist = stbi__zreceive(a, 5) + 1;
5409-	int hclen = stbi__zreceive(a, 4) + 4;
5410-	int ntot = hlit + hdist;
5411-
5412-	memset(codelength_sizes, 0, sizeof(codelength_sizes));
5413-	for (i = 0; i < hclen; ++i) {
5414-		int s = stbi__zreceive(a, 3);
5415-		codelength_sizes[length_dezigzag[i]] = (stbi_uc)s;
5416-	}
5417-	if (!stbi__zbuild_huffman(&z_codelength, codelength_sizes, 19)) {
5418-		return 0;
5419-	}
5420-
5421-	n = 0;
5422-	while (n < ntot) {
5423-		int c = stbi__zhuffman_decode(a, &z_codelength);
5424-		if (c < 0 || c >= 19) {
5425-			return stbi__err("bad codelengths", "Corrupt PNG");
5426-		}
5427-		if (c < 16) {
5428-			lencodes[n++] = (stbi_uc)c;
5429-		} else {
5430-			stbi_uc fill = 0;
5431-			if (c == 16) {
5432-				c = stbi__zreceive(a, 2) + 3;
5433-				if (n == 0) {
5434-					return stbi__err("bad codelengths", "Corrupt PNG");
5435-				}
5436-				fill = lencodes[n - 1];
5437-			} else if (c == 17) {
5438-				c = stbi__zreceive(a, 3) + 3;
5439-			} else if (c == 18) {
5440-				c = stbi__zreceive(a, 7) + 11;
5441-			} else {
5442-				return stbi__err("bad codelengths", "Corrupt PNG");
5443-			}
5444-			if (ntot - n < c) {
5445-				return stbi__err("bad codelengths", "Corrupt PNG");
5446-			}
5447-			memset(lencodes + n, fill, c);
5448-			n += c;
5449-		}
5450-	}
5451-	if (n != ntot) {
5452-		return stbi__err("bad codelengths", "Corrupt PNG");
5453-	}
5454-	if (!stbi__zbuild_huffman(&a->z_length, lencodes, hlit)) {
5455-		return 0;
5456-	}
5457-	if (!stbi__zbuild_huffman(&a->z_distance, lencodes + hlit, hdist)) {
5458-		return 0;
5459-	}
5460-	return 1;
5461-}
5462-
5463-static int
5464-stbi__parse_uncompressed_block(stbi__zbuf *a)
5465-{
5466-	stbi_uc header[4];
5467-	int len, nlen, k;
5468-	if (a->num_bits & 7) {
5469-		stbi__zreceive(a, a->num_bits & 7); // discard
5470-	}
5471-	// drain the bit-packed data into header
5472-	k = 0;
5473-	while (a->num_bits > 0) {
5474-		header[k++] =
5475-		    (stbi_uc)(a->code_buffer & 255); // suppress MSVC run-time check
5476-		a->code_buffer >>= 8;
5477-		a->num_bits -= 8;
5478-	}
5479-	if (a->num_bits < 0) {
5480-		return stbi__err("zlib corrupt", "Corrupt PNG");
5481-	}
5482-	// now fill header the normal way
5483-	while (k < 4) {
5484-		header[k++] = stbi__zget8(a);
5485-	}
5486-	len = header[1] * 256 + header[0];
5487-	nlen = header[3] * 256 + header[2];
5488-	if (nlen != (len ^ 0xffff)) {
5489-		return stbi__err("zlib corrupt", "Corrupt PNG");
5490-	}
5491-	if (a->zbuffer + len > a->zbuffer_end) {
5492-		return stbi__err("read past buffer", "Corrupt PNG");
5493-	}
5494-	if (a->zout + len > a->zout_end) {
5495-		if (!stbi__zexpand(a, a->zout, len)) {
5496-			return 0;
5497-		}
5498-	}
5499-	memcpy(a->zout, a->zbuffer, len);
5500-	a->zbuffer += len;
5501-	a->zout += len;
5502-	return 1;
5503-}
5504-
5505-static int
5506-stbi__parse_zlib_header(stbi__zbuf *a)
5507-{
5508-	int cmf = stbi__zget8(a);
5509-	int cm = cmf & 15;
5510-	/* int cinfo = cmf >> 4; */
5511-	int flg = stbi__zget8(a);
5512-	if (stbi__zeof(a)) {
5513-		return stbi__err("bad zlib header", "Corrupt PNG"); // zlib spec
5514-	}
5515-	if ((cmf * 256 + flg) % 31 != 0) {
5516-		return stbi__err("bad zlib header", "Corrupt PNG"); // zlib spec
5517-	}
5518-	if (flg & 32) {
5519-		return stbi__err("no preset dict",
5520-		                 "Corrupt PNG"); // preset dictionary not allowed in png
5521-	}
5522-	if (cm != 8) {
5523-		return stbi__err("bad compression",
5524-		                 "Corrupt PNG"); // DEFLATE required for png
5525-	}
5526-	// window = 1 << (8 + cinfo)... but who cares, we fully buffer output
5527-	return 1;
5528-}
5529-
5530-static const stbi_uc stbi__zdefault_length[STBI__ZNSYMS] = {
5531-    8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
5532-    8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
5533-    8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
5534-    8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
5535-    8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
5536-    8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
5537-    9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
5538-    9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
5539-    9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
5540-    9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
5541-    9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 7, 7, 7, 7, 7, 7, 7, 7,
5542-    7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8};
5543-static const stbi_uc stbi__zdefault_distance[32] = {
5544-    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5545-    5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5};
5546-/*
5547-Init algorithm:
5548-{
5549-   int i;   // use <= to match clearly with spec
5550-   for (i=0; i <= 143; ++i)     stbi__zdefault_length[i]   = 8;
5551-   for (   ; i <= 255; ++i)     stbi__zdefault_length[i]   = 9;
5552-   for (   ; i <= 279; ++i)     stbi__zdefault_length[i]   = 7;
5553-   for (   ; i <= 287; ++i)     stbi__zdefault_length[i]   = 8;
5554-
5555-   for (i=0; i <=  31; ++i)     stbi__zdefault_distance[i] = 5;
5556-}
5557-*/
5558-
5559-static int
5560-stbi__parse_zlib(stbi__zbuf *a, int parse_header)
5561-{
5562-	int final, type;
5563-	if (parse_header) {
5564-		if (!stbi__parse_zlib_header(a)) {
5565-			return 0;
5566-		}
5567-	}
5568-	a->num_bits = 0;
5569-	a->code_buffer = 0;
5570-	a->hit_zeof_once = 0;
5571-	do {
5572-		final = stbi__zreceive(a, 1);
5573-		type = stbi__zreceive(a, 2);
5574-		if (type == 0) {
5575-			if (!stbi__parse_uncompressed_block(a)) {
5576-				return 0;
5577-			}
5578-		} else if (type == 3) {
5579-			return 0;
5580-		} else {
5581-			if (type == 1) {
5582-				// use fixed code lengths
5583-				if (!stbi__zbuild_huffman(&a->z_length, stbi__zdefault_length,
5584-				                          STBI__ZNSYMS)) {
5585-					return 0;
5586-				}
5587-				if (!stbi__zbuild_huffman(&a->z_distance,
5588-				                          stbi__zdefault_distance, 32)) {
5589-					return 0;
5590-				}
5591-			} else {
5592-				if (!stbi__compute_huffman_codes(a)) {
5593-					return 0;
5594-				}
5595-			}
5596-			if (!stbi__parse_huffman_block(a)) {
5597-				return 0;
5598-			}
5599-		}
5600-	} while (!final);
5601-	return 1;
5602-}
5603-
5604-static int
5605-stbi__do_zlib(stbi__zbuf *a, char *obuf, int olen, int exp, int parse_header)
5606-{
5607-	a->zout_start = obuf;
5608-	a->zout = obuf;
5609-	a->zout_end = obuf + olen;
5610-	a->z_expandable = exp;
5611-
5612-	return stbi__parse_zlib(a, parse_header);
5613-}
5614-
5615-STBIDEF char *
5616-stbi_zlib_decode_malloc_guesssize(const char *buffer, int len, int initial_size,
5617-                                  int *outlen)
5618-{
5619-	stbi__zbuf a;
5620-	char *p = (char *)stbi__malloc(initial_size);
5621-	if (p == NULL) {
5622-		return NULL;
5623-	}
5624-	a.zbuffer = (stbi_uc *)buffer;
5625-	a.zbuffer_end = (stbi_uc *)buffer + len;
5626-	if (stbi__do_zlib(&a, p, initial_size, 1, 1)) {
5627-		if (outlen) {
5628-			*outlen = (int)(a.zout - a.zout_start);
5629-		}
5630-		return a.zout_start;
5631-	} else {
5632-		STBI_FREE(a.zout_start);
5633-		return NULL;
5634-	}
5635-}
5636-
5637-STBIDEF char *
5638-stbi_zlib_decode_malloc(char const *buffer, int len, int *outlen)
5639-{
5640-	return stbi_zlib_decode_malloc_guesssize(buffer, len, 16384, outlen);
5641-}
5642-
5643-STBIDEF char *
5644-stbi_zlib_decode_malloc_guesssize_headerflag(const char *buffer, int len,
5645-                                             int initial_size, int *outlen,
5646-                                             int parse_header)
5647-{
5648-	stbi__zbuf a;
5649-	char *p = (char *)stbi__malloc(initial_size);
5650-	if (p == NULL) {
5651-		return NULL;
5652-	}
5653-	a.zbuffer = (stbi_uc *)buffer;
5654-	a.zbuffer_end = (stbi_uc *)buffer + len;
5655-	if (stbi__do_zlib(&a, p, initial_size, 1, parse_header)) {
5656-		if (outlen) {
5657-			*outlen = (int)(a.zout - a.zout_start);
5658-		}
5659-		return a.zout_start;
5660-	} else {
5661-		STBI_FREE(a.zout_start);
5662-		return NULL;
5663-	}
5664-}
5665-
5666-STBIDEF int
5667-stbi_zlib_decode_buffer(char *obuffer, int olen, char const *ibuffer, int ilen)
5668-{
5669-	stbi__zbuf a;
5670-	a.zbuffer = (stbi_uc *)ibuffer;
5671-	a.zbuffer_end = (stbi_uc *)ibuffer + ilen;
5672-	if (stbi__do_zlib(&a, obuffer, olen, 0, 1)) {
5673-		return (int)(a.zout - a.zout_start);
5674-	} else {
5675-		return -1;
5676-	}
5677-}
5678-
5679-STBIDEF char *
5680-stbi_zlib_decode_noheader_malloc(char const *buffer, int len, int *outlen)
5681-{
5682-	stbi__zbuf a;
5683-	char *p = (char *)stbi__malloc(16384);
5684-	if (p == NULL) {
5685-		return NULL;
5686-	}
5687-	a.zbuffer = (stbi_uc *)buffer;
5688-	a.zbuffer_end = (stbi_uc *)buffer + len;
5689-	if (stbi__do_zlib(&a, p, 16384, 1, 0)) {
5690-		if (outlen) {
5691-			*outlen = (int)(a.zout - a.zout_start);
5692-		}
5693-		return a.zout_start;
5694-	} else {
5695-		STBI_FREE(a.zout_start);
5696-		return NULL;
5697-	}
5698-}
5699-
5700-STBIDEF int
5701-stbi_zlib_decode_noheader_buffer(char *obuffer, int olen, const char *ibuffer,
5702-                                 int ilen)
5703-{
5704-	stbi__zbuf a;
5705-	a.zbuffer = (stbi_uc *)ibuffer;
5706-	a.zbuffer_end = (stbi_uc *)ibuffer + ilen;
5707-	if (stbi__do_zlib(&a, obuffer, olen, 0, 0)) {
5708-		return (int)(a.zout - a.zout_start);
5709-	} else {
5710-		return -1;
5711-	}
5712-}
5713-#endif
5714-
5715-// public domain "baseline" PNG decoder   v0.10  Sean Barrett 2006-11-18
5716-//    simple implementation
5717-//      - only 8-bit samples
5718-//      - no CRC checking
5719-//      - allocates lots of intermediate memory
5720-//        - avoids problem of streaming data between subsystems
5721-//        - avoids explicit window management
5722-//    performance
5723-//      - uses stb_zlib, a PD zlib implementation with fast huffman decoding
5724-
5725-#ifndef STBI_NO_PNG
5726-typedef struct {
5727-	stbi__uint32 length;
5728-	stbi__uint32 type;
5729-} stbi__pngchunk;
5730-
5731-static stbi__pngchunk
5732-stbi__get_chunk_header(stbi__context *s)
5733-{
5734-	stbi__pngchunk c;
5735-	c.length = stbi__get32be(s);
5736-	c.type = stbi__get32be(s);
5737-	return c;
5738-}
5739-
5740-static int
5741-stbi__check_png_header(stbi__context *s)
5742-{
5743-	static const stbi_uc png_sig[8] = {137, 80, 78, 71, 13, 10, 26, 10};
5744-	int i;
5745-	for (i = 0; i < 8; ++i) {
5746-		if (stbi__get8(s) != png_sig[i]) {
5747-			return stbi__err("bad png sig", "Not a PNG");
5748-		}
5749-	}
5750-	return 1;
5751-}
5752-
5753-typedef struct {
5754-	stbi__context *s;
5755-	stbi_uc *idata, *expanded, *out;
5756-	int depth;
5757-} stbi__png;
5758-
5759-enum {
5760-	STBI__F_none = 0,
5761-	STBI__F_sub = 1,
5762-	STBI__F_up = 2,
5763-	STBI__F_avg = 3,
5764-	STBI__F_paeth = 4,
5765-	// synthetic filter used for first scanline to avoid needing a dummy row of
5766-	// 0s
5767-	STBI__F_avg_first
5768-};
5769-
5770-static stbi_uc first_row_filter[5] = {
5771-    STBI__F_none, STBI__F_sub, STBI__F_none, STBI__F_avg_first,
5772-    STBI__F_sub // Paeth with b=c=0 turns out to be equivalent to sub
5773-};
5774-
5775-static int
5776-stbi__paeth(int a, int b, int c)
5777-{
5778-	// This formulation looks very different from the reference in the PNG spec,
5779-	// but is actually equivalent and has favorable data dependencies and admits
5780-	// straightforward generation of branch-free code, which helps performance
5781-	// significantly.
5782-	int thresh = c * 3 - (a + b);
5783-	int lo = a < b ? a : b;
5784-	int hi = a < b ? b : a;
5785-	int t0 = (hi <= thresh) ? lo : c;
5786-	int t1 = (thresh <= lo) ? hi : t0;
5787-	return t1;
5788-}
5789-
5790-static const stbi_uc stbi__depth_scale_table[9] = {0, 0xff, 0x55, 0,   0x11,
5791-                                                   0, 0,    0,    0x01};
5792-
5793-// adds an extra all-255 alpha channel
5794-// dest == src is legal
5795-// img_n must be 1 or 3
5796-static void
5797-stbi__create_png_alpha_expand8(stbi_uc *dest, stbi_uc *src, stbi__uint32 x,
5798-                               int img_n)
5799-{
5800-	int i;
5801-	// must process data backwards since we allow dest==src
5802-	if (img_n == 1) {
5803-		for (i = x - 1; i >= 0; --i) {
5804-			dest[i * 2 + 1] = 255;
5805-			dest[i * 2 + 0] = src[i];
5806-		}
5807-	} else {
5808-		STBI_ASSERT(img_n == 3);
5809-		for (i = x - 1; i >= 0; --i) {
5810-			dest[i * 4 + 3] = 255;
5811-			dest[i * 4 + 2] = src[i * 3 + 2];
5812-			dest[i * 4 + 1] = src[i * 3 + 1];
5813-			dest[i * 4 + 0] = src[i * 3 + 0];
5814-		}
5815-	}
5816-}
5817-
5818-// create the png data from post-deflated data
5819-static int
5820-stbi__create_png_image_raw(stbi__png *a, stbi_uc *raw, stbi__uint32 raw_len,
5821-                           int out_n, stbi__uint32 x, stbi__uint32 y, int depth,
5822-                           int color)
5823-{
5824-	int bytes = (depth == 16 ? 2 : 1);
5825-	stbi__context *s = a->s;
5826-	stbi__uint32 i, j, stride = x * out_n * bytes;
5827-	stbi__uint32 img_len, img_width_bytes;
5828-	stbi_uc *filter_buf;
5829-	int all_ok = 1;
5830-	int k;
5831-	int img_n = s->img_n; // copy it into a local for later
5832-
5833-	int output_bytes = out_n * bytes;
5834-	int filter_bytes = img_n * bytes;
5835-	int width = x;
5836-
5837-	STBI_ASSERT(out_n == s->img_n || out_n == s->img_n + 1);
5838-	a->out = (stbi_uc *)stbi__malloc_mad3(
5839-	    x, y, output_bytes, 0); // extra bytes to write off the end into
5840-	if (!a->out) {
5841-		return stbi__err("outofmem", "Out of memory");
5842-	}
5843-
5844-	// note: error exits here don't need to clean up a->out individually,
5845-	// stbi__do_png always does on error.
5846-	if (!stbi__mad3sizes_valid(img_n, x, depth, 7)) {
5847-		return stbi__err("too large", "Corrupt PNG");
5848-	}
5849-	img_width_bytes = (((img_n * x * depth) + 7) >> 3);
5850-	if (!stbi__mad2sizes_valid(img_width_bytes, y, img_width_bytes)) {
5851-		return stbi__err("too large", "Corrupt PNG");
5852-	}
5853-	img_len = (img_width_bytes + 1) * y;
5854-
5855-	// we used to check for exact match between raw_len and img_len on
5856-	// non-interlaced PNGs, but issue #276 reported a PNG in the wild that had
5857-	// extra data at the end (all zeros), so just check for raw_len < img_len
5858-	// always.
5859-	if (raw_len < img_len) {
5860-		return stbi__err("not enough pixels", "Corrupt PNG");
5861-	}
5862-
5863-	// Allocate two scan lines worth of filter workspace buffer.
5864-	filter_buf = (stbi_uc *)stbi__malloc_mad2(img_width_bytes, 2, 0);
5865-	if (!filter_buf) {
5866-		return stbi__err("outofmem", "Out of memory");
5867-	}
5868-
5869-	// Filtering for low-bit-depth images
5870-	if (depth < 8) {
5871-		filter_bytes = 1;
5872-		width = img_width_bytes;
5873-	}
5874-
5875-	for (j = 0; j < y; ++j) {
5876-		// cur/prior filter buffers alternate
5877-		stbi_uc *cur = filter_buf + (j & 1) * img_width_bytes;
5878-		stbi_uc *prior = filter_buf + (~j & 1) * img_width_bytes;
5879-		stbi_uc *dest = a->out + stride * j;
5880-		int nk = width * filter_bytes;
5881-		int filter = *raw++;
5882-
5883-		// check filter type
5884-		if (filter > 4) {
5885-			all_ok = stbi__err("invalid filter", "Corrupt PNG");
5886-			break;
5887-		}
5888-
5889-		// if first row, use special filter that doesn't sample previous row
5890-		if (j == 0) {
5891-			filter = first_row_filter[filter];
5892-		}
5893-
5894-		// perform actual filtering
5895-		switch (filter) {
5896-		case STBI__F_none:
5897-			memcpy(cur, raw, nk);
5898-			break;
5899-		case STBI__F_sub:
5900-			memcpy(cur, raw, filter_bytes);
5901-			for (k = filter_bytes; k < nk; ++k) {
5902-				cur[k] = STBI__BYTECAST(raw[k] + cur[k - filter_bytes]);
5903-			}
5904-			break;
5905-		case STBI__F_up:
5906-			for (k = 0; k < nk; ++k) {
5907-				cur[k] = STBI__BYTECAST(raw[k] + prior[k]);
5908-			}
5909-			break;
5910-		case STBI__F_avg:
5911-			for (k = 0; k < filter_bytes; ++k) {
5912-				cur[k] = STBI__BYTECAST(raw[k] + (prior[k] >> 1));
5913-			}
5914-			for (k = filter_bytes; k < nk; ++k) {
5915-				cur[k] = STBI__BYTECAST(
5916-				    raw[k] + ((prior[k] + cur[k - filter_bytes]) >> 1));
5917-			}
5918-			break;
5919-		case STBI__F_paeth:
5920-			for (k = 0; k < filter_bytes; ++k) {
5921-				cur[k] = STBI__BYTECAST(
5922-				    raw[k] + prior[k]); // prior[k] == stbi__paeth(0,prior[k],0)
5923-			}
5924-			for (k = filter_bytes; k < nk; ++k) {
5925-				cur[k] = STBI__BYTECAST(
5926-				    raw[k] + stbi__paeth(cur[k - filter_bytes], prior[k],
5927-				                         prior[k - filter_bytes]));
5928-			}
5929-			break;
5930-		case STBI__F_avg_first:
5931-			memcpy(cur, raw, filter_bytes);
5932-			for (k = filter_bytes; k < nk; ++k) {
5933-				cur[k] = STBI__BYTECAST(raw[k] + (cur[k - filter_bytes] >> 1));
5934-			}
5935-			break;
5936-		}
5937-
5938-		raw += nk;
5939-
5940-		// expand decoded bits in cur to dest, also adding an extra alpha
5941-		// channel if desired
5942-		if (depth < 8) {
5943-			stbi_uc scale = (color == 0)
5944-			                    ? stbi__depth_scale_table[depth]
5945-			                    : 1; // scale grayscale values to 0..255 range
5946-			stbi_uc *in = cur;
5947-			stbi_uc *out = dest;
5948-			stbi_uc inb = 0;
5949-			stbi__uint32 nsmp = x * img_n;
5950-
5951-			// expand bits to bytes first
5952-			if (depth == 4) {
5953-				for (i = 0; i < nsmp; ++i) {
5954-					if ((i & 1) == 0) {
5955-						inb = *in++;
5956-					}
5957-					*out++ = scale * (inb >> 4);
5958-					inb <<= 4;
5959-				}
5960-			} else if (depth == 2) {
5961-				for (i = 0; i < nsmp; ++i) {
5962-					if ((i & 3) == 0) {
5963-						inb = *in++;
5964-					}
5965-					*out++ = scale * (inb >> 6);
5966-					inb <<= 2;
5967-				}
5968-			} else {
5969-				STBI_ASSERT(depth == 1);
5970-				for (i = 0; i < nsmp; ++i) {
5971-					if ((i & 7) == 0) {
5972-						inb = *in++;
5973-					}
5974-					*out++ = scale * (inb >> 7);
5975-					inb <<= 1;
5976-				}
5977-			}
5978-
5979-			// insert alpha=255 values if desired
5980-			if (img_n != out_n) {
5981-				stbi__create_png_alpha_expand8(dest, dest, x, img_n);
5982-			}
5983-		} else if (depth == 8) {
5984-			if (img_n == out_n) {
5985-				memcpy(dest, cur, x * img_n);
5986-			} else {
5987-				stbi__create_png_alpha_expand8(dest, cur, x, img_n);
5988-			}
5989-		} else if (depth == 16) {
5990-			// convert the image data from big-endian to platform-native
5991-			stbi__uint16 *dest16 = (stbi__uint16 *)dest;
5992-			stbi__uint32 nsmp = x * img_n;
5993-
5994-			if (img_n == out_n) {
5995-				for (i = 0; i < nsmp; ++i, ++dest16, cur += 2) {
5996-					*dest16 = (cur[0] << 8) | cur[1];
5997-				}
5998-			} else {
5999-				STBI_ASSERT(img_n + 1 == out_n);
6000-				if (img_n == 1) {
6001-					for (i = 0; i < x; ++i, dest16 += 2, cur += 2) {
6002-						dest16[0] = (cur[0] << 8) | cur[1];
6003-						dest16[1] = 0xffff;
6004-					}
6005-				} else {
6006-					STBI_ASSERT(img_n == 3);
6007-					for (i = 0; i < x; ++i, dest16 += 4, cur += 6) {
6008-						dest16[0] = (cur[0] << 8) | cur[1];
6009-						dest16[1] = (cur[2] << 8) | cur[3];
6010-						dest16[2] = (cur[4] << 8) | cur[5];
6011-						dest16[3] = 0xffff;
6012-					}
6013-				}
6014-			}
6015-		}
6016-	}
6017-
6018-	STBI_FREE(filter_buf);
6019-	if (!all_ok) {
6020-		return 0;
6021-	}
6022-
6023-	return 1;
6024-}
6025-
6026-static int
6027-stbi__create_png_image(stbi__png *a, stbi_uc *image_data,
6028-                       stbi__uint32 image_data_len, int out_n, int depth,
6029-                       int color, int interlaced)
6030-{
6031-	int bytes = (depth == 16 ? 2 : 1);
6032-	int out_bytes = out_n * bytes;
6033-	stbi_uc *final;
6034-	int p;
6035-	if (!interlaced) {
6036-		return stbi__create_png_image_raw(a, image_data, image_data_len, out_n,
6037-		                                  a->s->img_x, a->s->img_y, depth,
6038-		                                  color);
6039-	}
6040-
6041-	// de-interlacing
6042-	final =
6043-	    (stbi_uc *)stbi__malloc_mad3(a->s->img_x, a->s->img_y, out_bytes, 0);
6044-	if (!final) {
6045-		return stbi__err("outofmem", "Out of memory");
6046-	}
6047-	for (p = 0; p < 7; ++p) {
6048-		int xorig[] = {0, 4, 0, 2, 0, 1, 0};
6049-		int yorig[] = {0, 0, 4, 0, 2, 0, 1};
6050-		int xspc[] = {8, 8, 4, 4, 2, 2, 1};
6051-		int yspc[] = {8, 8, 8, 4, 4, 2, 2};
6052-		int i, j, x, y;
6053-		// pass1_x[4] = 0, pass1_x[5] = 1, pass1_x[12] = 1
6054-		x = (a->s->img_x - xorig[p] + xspc[p] - 1) / xspc[p];
6055-		y = (a->s->img_y - yorig[p] + yspc[p] - 1) / yspc[p];
6056-		if (x && y) {
6057-			stbi__uint32 img_len =
6058-			    ((((a->s->img_n * x * depth) + 7) >> 3) + 1) * y;
6059-			if (!stbi__create_png_image_raw(a, image_data, image_data_len,
6060-			                                out_n, x, y, depth, color)) {
6061-				STBI_FREE(final);
6062-				return 0;
6063-			}
6064-			for (j = 0; j < y; ++j) {
6065-				for (i = 0; i < x; ++i) {
6066-					int out_y = j * yspc[p] + yorig[p];
6067-					int out_x = i * xspc[p] + xorig[p];
6068-					memcpy(final + out_y * a->s->img_x * out_bytes +
6069-					           out_x * out_bytes,
6070-					       a->out + (j * x + i) * out_bytes, out_bytes);
6071-				}
6072-			}
6073-			STBI_FREE(a->out);
6074-			image_data += img_len;
6075-			image_data_len -= img_len;
6076-		}
6077-	}
6078-	a->out = final;
6079-
6080-	return 1;
6081-}
6082-
6083-static int
6084-stbi__compute_transparency(stbi__png *z, stbi_uc tc[3], int out_n)
6085-{
6086-	stbi__context *s = z->s;
6087-	stbi__uint32 i, pixel_count = s->img_x * s->img_y;
6088-	stbi_uc *p = z->out;
6089-
6090-	// compute color-based transparency, assuming we've
6091-	// already got 255 as the alpha value in the output
6092-	STBI_ASSERT(out_n == 2 || out_n == 4);
6093-
6094-	if (out_n == 2) {
6095-		for (i = 0; i < pixel_count; ++i) {
6096-			p[1] = (p[0] == tc[0] ? 0 : 255);
6097-			p += 2;
6098-		}
6099-	} else {
6100-		for (i = 0; i < pixel_count; ++i) {
6101-			if (p[0] == tc[0] && p[1] == tc[1] && p[2] == tc[2]) {
6102-				p[3] = 0;
6103-			}
6104-			p += 4;
6105-		}
6106-	}
6107-	return 1;
6108-}
6109-
6110-static int
6111-stbi__compute_transparency16(stbi__png *z, stbi__uint16 tc[3], int out_n)
6112-{
6113-	stbi__context *s = z->s;
6114-	stbi__uint32 i, pixel_count = s->img_x * s->img_y;
6115-	stbi__uint16 *p = (stbi__uint16 *)z->out;
6116-
6117-	// compute color-based transparency, assuming we've
6118-	// already got 65535 as the alpha value in the output
6119-	STBI_ASSERT(out_n == 2 || out_n == 4);
6120-
6121-	if (out_n == 2) {
6122-		for (i = 0; i < pixel_count; ++i) {
6123-			p[1] = (p[0] == tc[0] ? 0 : 65535);
6124-			p += 2;
6125-		}
6126-	} else {
6127-		for (i = 0; i < pixel_count; ++i) {
6128-			if (p[0] == tc[0] && p[1] == tc[1] && p[2] == tc[2]) {
6129-				p[3] = 0;
6130-			}
6131-			p += 4;
6132-		}
6133-	}
6134-	return 1;
6135-}
6136-
6137-static int
6138-stbi__expand_png_palette(stbi__png *a, stbi_uc *palette, int len, int pal_img_n)
6139-{
6140-	stbi__uint32 i, pixel_count = a->s->img_x * a->s->img_y;
6141-	stbi_uc *p, *temp_out, *orig = a->out;
6142-
6143-	p = (stbi_uc *)stbi__malloc_mad2(pixel_count, pal_img_n, 0);
6144-	if (p == NULL) {
6145-		return stbi__err("outofmem", "Out of memory");
6146-	}
6147-
6148-	// between here and free(out) below, exitting would leak
6149-	temp_out = p;
6150-
6151-	if (pal_img_n == 3) {
6152-		for (i = 0; i < pixel_count; ++i) {
6153-			int n = orig[i] * 4;
6154-			p[0] = palette[n];
6155-			p[1] = palette[n + 1];
6156-			p[2] = palette[n + 2];
6157-			p += 3;
6158-		}
6159-	} else {
6160-		for (i = 0; i < pixel_count; ++i) {
6161-			int n = orig[i] * 4;
6162-			p[0] = palette[n];
6163-			p[1] = palette[n + 1];
6164-			p[2] = palette[n + 2];
6165-			p[3] = palette[n + 3];
6166-			p += 4;
6167-		}
6168-	}
6169-	STBI_FREE(a->out);
6170-	a->out = temp_out;
6171-
6172-	STBI_NOTUSED(len);
6173-
6174-	return 1;
6175-}
6176-
6177-static int stbi__unpremultiply_on_load_global = 0;
6178-static int stbi__de_iphone_flag_global = 0;
6179-
6180-STBIDEF void
6181-stbi_set_unpremultiply_on_load(int flag_true_if_should_unpremultiply)
6182-{
6183-	stbi__unpremultiply_on_load_global = flag_true_if_should_unpremultiply;
6184-}
6185-
6186-STBIDEF void
6187-stbi_convert_iphone_png_to_rgb(int flag_true_if_should_convert)
6188-{
6189-	stbi__de_iphone_flag_global = flag_true_if_should_convert;
6190-}
6191-
6192-#ifndef STBI_THREAD_LOCAL
6193-#define stbi__unpremultiply_on_load stbi__unpremultiply_on_load_global
6194-#define stbi__de_iphone_flag stbi__de_iphone_flag_global
6195-#else
6196-static STBI_THREAD_LOCAL int stbi__unpremultiply_on_load_local,
6197-    stbi__unpremultiply_on_load_set;
6198-static STBI_THREAD_LOCAL int stbi__de_iphone_flag_local,
6199-    stbi__de_iphone_flag_set;
6200-
6201-STBIDEF void
6202-stbi_set_unpremultiply_on_load_thread(int flag_true_if_should_unpremultiply)
6203-{
6204-	stbi__unpremultiply_on_load_local = flag_true_if_should_unpremultiply;
6205-	stbi__unpremultiply_on_load_set = 1;
6206-}
6207-
6208-STBIDEF void
6209-stbi_convert_iphone_png_to_rgb_thread(int flag_true_if_should_convert)
6210-{
6211-	stbi__de_iphone_flag_local = flag_true_if_should_convert;
6212-	stbi__de_iphone_flag_set = 1;
6213-}
6214-
6215-#define stbi__unpremultiply_on_load                                            \
6216-	(stbi__unpremultiply_on_load_set ? stbi__unpremultiply_on_load_local       \
6217-	                                 : stbi__unpremultiply_on_load_global)
6218-#define stbi__de_iphone_flag                                                   \
6219-	(stbi__de_iphone_flag_set ? stbi__de_iphone_flag_local                     \
6220-	                          : stbi__de_iphone_flag_global)
6221-#endif // STBI_THREAD_LOCAL
6222-
6223-static void
6224-stbi__de_iphone(stbi__png *z)
6225-{
6226-	stbi__context *s = z->s;
6227-	stbi__uint32 i, pixel_count = s->img_x * s->img_y;
6228-	stbi_uc *p = z->out;
6229-
6230-	if (s->img_out_n == 3) { // convert bgr to rgb
6231-		for (i = 0; i < pixel_count; ++i) {
6232-			stbi_uc t = p[0];
6233-			p[0] = p[2];
6234-			p[2] = t;
6235-			p += 3;
6236-		}
6237-	} else {
6238-		STBI_ASSERT(s->img_out_n == 4);
6239-		if (stbi__unpremultiply_on_load) {
6240-			// convert bgr to rgb and unpremultiply
6241-			for (i = 0; i < pixel_count; ++i) {
6242-				stbi_uc a = p[3];
6243-				stbi_uc t = p[0];
6244-				if (a) {
6245-					stbi_uc half = a / 2;
6246-					p[0] = (p[2] * 255 + half) / a;
6247-					p[1] = (p[1] * 255 + half) / a;
6248-					p[2] = (t * 255 + half) / a;
6249-				} else {
6250-					p[0] = p[2];
6251-					p[2] = t;
6252-				}
6253-				p += 4;
6254-			}
6255-		} else {
6256-			// convert bgr to rgb
6257-			for (i = 0; i < pixel_count; ++i) {
6258-				stbi_uc t = p[0];
6259-				p[0] = p[2];
6260-				p[2] = t;
6261-				p += 4;
6262-			}
6263-		}
6264-	}
6265-}
6266-
6267-#define STBI__PNG_TYPE(a, b, c, d)                                             \
6268-	(((unsigned)(a) << 24) + ((unsigned)(b) << 16) + ((unsigned)(c) << 8) +    \
6269-	 (unsigned)(d))
6270-
6271-static int
6272-stbi__parse_png_file(stbi__png *z, int scan, int req_comp)
6273-{
6274-	stbi_uc palette[1024], pal_img_n = 0;
6275-	stbi_uc has_trans = 0, tc[3] = {0};
6276-	stbi__uint16 tc16[3];
6277-	stbi__uint32 ioff = 0, idata_limit = 0, i, pal_len = 0;
6278-	int first = 1, k, interlace = 0, color = 0, is_iphone = 0;
6279-	stbi__context *s = z->s;
6280-
6281-	z->expanded = NULL;
6282-	z->idata = NULL;
6283-	z->out = NULL;
6284-
6285-	if (!stbi__check_png_header(s)) {
6286-		return 0;
6287-	}
6288-
6289-	if (scan == STBI__SCAN_type) {
6290-		return 1;
6291-	}
6292-
6293-	for (;;) {
6294-		stbi__pngchunk c = stbi__get_chunk_header(s);
6295-		switch (c.type) {
6296-		case STBI__PNG_TYPE('C', 'g', 'B', 'I'):
6297-			is_iphone = 1;
6298-			stbi__skip(s, c.length);
6299-			break;
6300-		case STBI__PNG_TYPE('I', 'H', 'D', 'R'): {
6301-			int comp, filter;
6302-			if (!first) {
6303-				return stbi__err("multiple IHDR", "Corrupt PNG");
6304-			}
6305-			first = 0;
6306-			if (c.length != 13) {
6307-				return stbi__err("bad IHDR len", "Corrupt PNG");
6308-			}
6309-			s->img_x = stbi__get32be(s);
6310-			s->img_y = stbi__get32be(s);
6311-			if (s->img_y > STBI_MAX_DIMENSIONS) {
6312-				return stbi__err("too large", "Very large image (corrupt?)");
6313-			}
6314-			if (s->img_x > STBI_MAX_DIMENSIONS) {
6315-				return stbi__err("too large", "Very large image (corrupt?)");
6316-			}
6317-			z->depth = stbi__get8(s);
6318-			if (z->depth != 1 && z->depth != 2 && z->depth != 4 &&
6319-			    z->depth != 8 && z->depth != 16) {
6320-				return stbi__err("1/2/4/8/16-bit only",
6321-				                 "PNG not supported: 1/2/4/8/16-bit only");
6322-			}
6323-			color = stbi__get8(s);
6324-			if (color > 6) {
6325-				return stbi__err("bad ctype", "Corrupt PNG");
6326-			}
6327-			if (color == 3 && z->depth == 16) {
6328-				return stbi__err("bad ctype", "Corrupt PNG");
6329-			}
6330-			if (color == 3) {
6331-				pal_img_n = 3;
6332-			} else if (color & 1) {
6333-				return stbi__err("bad ctype", "Corrupt PNG");
6334-			}
6335-			comp = stbi__get8(s);
6336-			if (comp) {
6337-				return stbi__err("bad comp method", "Corrupt PNG");
6338-			}
6339-			filter = stbi__get8(s);
6340-			if (filter) {
6341-				return stbi__err("bad filter method", "Corrupt PNG");
6342-			}
6343-			interlace = stbi__get8(s);
6344-			if (interlace > 1) {
6345-				return stbi__err("bad interlace method", "Corrupt PNG");
6346-			}
6347-			if (!s->img_x || !s->img_y) {
6348-				return stbi__err("0-pixel image", "Corrupt PNG");
6349-			}
6350-			if (!pal_img_n) {
6351-				s->img_n = (color & 2 ? 3 : 1) + (color & 4 ? 1 : 0);
6352-				if ((1 << 30) / s->img_x / s->img_n < s->img_y) {
6353-					return stbi__err("too large", "Image too large to decode");
6354-				}
6355-			} else {
6356-				// if paletted, then pal_n is our final components, and
6357-				// img_n is # components to decompress/filter.
6358-				s->img_n = 1;
6359-				if ((1 << 30) / s->img_x / 4 < s->img_y) {
6360-					return stbi__err("too large", "Corrupt PNG");
6361-				}
6362-			}
6363-			// even with SCAN_header, have to scan to see if we have a tRNS
6364-			break;
6365-		}
6366-
6367-		case STBI__PNG_TYPE('P', 'L', 'T', 'E'): {
6368-			if (first) {
6369-				return stbi__err("first not IHDR", "Corrupt PNG");
6370-			}
6371-			if (c.length > 256 * 3) {
6372-				return stbi__err("invalid PLTE", "Corrupt PNG");
6373-			}
6374-			pal_len = c.length / 3;
6375-			if (pal_len * 3 != c.length) {
6376-				return stbi__err("invalid PLTE", "Corrupt PNG");
6377-			}
6378-			for (i = 0; i < pal_len; ++i) {
6379-				palette[i * 4 + 0] = stbi__get8(s);
6380-				palette[i * 4 + 1] = stbi__get8(s);
6381-				palette[i * 4 + 2] = stbi__get8(s);
6382-				palette[i * 4 + 3] = 255;
6383-			}
6384-			break;
6385-		}
6386-
6387-		case STBI__PNG_TYPE('t', 'R', 'N', 'S'): {
6388-			if (first) {
6389-				return stbi__err("first not IHDR", "Corrupt PNG");
6390-			}
6391-			if (z->idata) {
6392-				return stbi__err("tRNS after IDAT", "Corrupt PNG");
6393-			}
6394-			if (pal_img_n) {
6395-				if (scan == STBI__SCAN_header) {
6396-					s->img_n = 4;
6397-					return 1;
6398-				}
6399-				if (pal_len == 0) {
6400-					return stbi__err("tRNS before PLTE", "Corrupt PNG");
6401-				}
6402-				if (c.length > pal_len) {
6403-					return stbi__err("bad tRNS len", "Corrupt PNG");
6404-				}
6405-				pal_img_n = 4;
6406-				for (i = 0; i < c.length; ++i) {
6407-					palette[i * 4 + 3] = stbi__get8(s);
6408-				}
6409-			} else {
6410-				if (!(s->img_n & 1)) {
6411-					return stbi__err("tRNS with alpha", "Corrupt PNG");
6412-				}
6413-				if (c.length != (stbi__uint32)s->img_n * 2) {
6414-					return stbi__err("bad tRNS len", "Corrupt PNG");
6415-				}
6416-				has_trans = 1;
6417-				// non-paletted with tRNS = constant alpha. if header-scanning,
6418-				// we can stop now.
6419-				if (scan == STBI__SCAN_header) {
6420-					++s->img_n;
6421-					return 1;
6422-				}
6423-				if (z->depth == 16) {
6424-					for (k = 0; k < s->img_n && k < 3;
6425-					     ++k) { // extra loop test to suppress false GCC warning
6426-						tc16[k] = (stbi__uint16)stbi__get16be(
6427-						    s); // copy the values as-is
6428-					}
6429-				} else {
6430-					for (k = 0; k < s->img_n && k < 3; ++k) {
6431-						tc[k] =
6432-						    (stbi_uc)(stbi__get16be(s) & 255) *
6433-						    stbi__depth_scale_table
6434-						        [z->depth]; // non 8-bit images will be larger
6435-					}
6436-				}
6437-			}
6438-			break;
6439-		}
6440-
6441-		case STBI__PNG_TYPE('I', 'D', 'A', 'T'): {
6442-			if (first) {
6443-				return stbi__err("first not IHDR", "Corrupt PNG");
6444-			}
6445-			if (pal_img_n && !pal_len) {
6446-				return stbi__err("no PLTE", "Corrupt PNG");
6447-			}
6448-			if (scan == STBI__SCAN_header) {
6449-				// header scan definitely stops at first IDAT
6450-				if (pal_img_n) {
6451-					s->img_n = pal_img_n;
6452-				}
6453-				return 1;
6454-			}
6455-			if (c.length > (1u << 30)) {
6456-				return stbi__err("IDAT size limit",
6457-				                 "IDAT section larger than 2^30 bytes");
6458-			}
6459-			if ((int)(ioff + c.length) < (int)ioff) {
6460-				return 0;
6461-			}
6462-			if (ioff + c.length > idata_limit) {
6463-				stbi__uint32 idata_limit_old = idata_limit;
6464-				stbi_uc *p;
6465-				if (idata_limit == 0) {
6466-					idata_limit = c.length > 4096 ? c.length : 4096;
6467-				}
6468-				while (ioff + c.length > idata_limit) {
6469-					idata_limit *= 2;
6470-				}
6471-				STBI_NOTUSED(idata_limit_old);
6472-				p = (stbi_uc *)STBI_REALLOC_SIZED(z->idata, idata_limit_old,
6473-				                                  idata_limit);
6474-				if (p == NULL) {
6475-					return stbi__err("outofmem", "Out of memory");
6476-				}
6477-				z->idata = p;
6478-			}
6479-			if (!stbi__getn(s, z->idata + ioff, c.length)) {
6480-				return stbi__err("outofdata", "Corrupt PNG");
6481-			}
6482-			ioff += c.length;
6483-			break;
6484-		}
6485-
6486-		case STBI__PNG_TYPE('I', 'E', 'N', 'D'): {
6487-			stbi__uint32 raw_len, bpl;
6488-			if (first) {
6489-				return stbi__err("first not IHDR", "Corrupt PNG");
6490-			}
6491-			if (scan != STBI__SCAN_load) {
6492-				return 1;
6493-			}
6494-			if (z->idata == NULL) {
6495-				return stbi__err("no IDAT", "Corrupt PNG");
6496-			}
6497-			// initial guess for decoded data size to avoid unnecessary reallocs
6498-			bpl =
6499-			    (s->img_x * z->depth + 7) / 8; // bytes per line, per component
6500-			raw_len = bpl * s->img_y * s->img_n /* pixels */ +
6501-			          s->img_y /* filter mode per row */;
6502-			z->expanded =
6503-			    (stbi_uc *)stbi_zlib_decode_malloc_guesssize_headerflag(
6504-			        (char *)z->idata, ioff, raw_len, (int *)&raw_len,
6505-			        !is_iphone);
6506-			if (z->expanded == NULL) {
6507-				return 0; // zlib should set error
6508-			}
6509-			STBI_FREE(z->idata);
6510-			z->idata = NULL;
6511-			if ((req_comp == s->img_n + 1 && req_comp != 3 && !pal_img_n) ||
6512-			    has_trans) {
6513-				s->img_out_n = s->img_n + 1;
6514-			} else {
6515-				s->img_out_n = s->img_n;
6516-			}
6517-			if (!stbi__create_png_image(z, z->expanded, raw_len, s->img_out_n,
6518-			                            z->depth, color, interlace)) {
6519-				return 0;
6520-			}
6521-			if (has_trans) {
6522-				if (z->depth == 16) {
6523-					if (!stbi__compute_transparency16(z, tc16, s->img_out_n)) {
6524-						return 0;
6525-					}
6526-				} else {
6527-					if (!stbi__compute_transparency(z, tc, s->img_out_n)) {
6528-						return 0;
6529-					}
6530-				}
6531-			}
6532-			if (is_iphone && stbi__de_iphone_flag && s->img_out_n > 2) {
6533-				stbi__de_iphone(z);
6534-			}
6535-			if (pal_img_n) {
6536-				// pal_img_n == 3 or 4
6537-				s->img_n = pal_img_n; // record the actual colors we had
6538-				s->img_out_n = pal_img_n;
6539-				if (req_comp >= 3) {
6540-					s->img_out_n = req_comp;
6541-				}
6542-				if (!stbi__expand_png_palette(z, palette, pal_len,
6543-				                              s->img_out_n)) {
6544-					return 0;
6545-				}
6546-			} else if (has_trans) {
6547-				// non-paletted image with tRNS -> source image has (constant)
6548-				// alpha
6549-				++s->img_n;
6550-			}
6551-			STBI_FREE(z->expanded);
6552-			z->expanded = NULL;
6553-			// end of PNG chunk, read and skip CRC
6554-			stbi__get32be(s);
6555-			return 1;
6556-		}
6557-
6558-		default:
6559-			// if critical, fail
6560-			if (first) {
6561-				return stbi__err("first not IHDR", "Corrupt PNG");
6562-			}
6563-			if ((c.type & (1 << 29)) == 0) {
6564-#ifndef STBI_NO_FAILURE_STRINGS
6565-				// not threadsafe
6566-				static char invalid_chunk[] = "XXXX PNG chunk not known";
6567-				invalid_chunk[0] = STBI__BYTECAST(c.type >> 24);
6568-				invalid_chunk[1] = STBI__BYTECAST(c.type >> 16);
6569-				invalid_chunk[2] = STBI__BYTECAST(c.type >> 8);
6570-				invalid_chunk[3] = STBI__BYTECAST(c.type >> 0);
6571-#endif
6572-				return stbi__err(invalid_chunk,
6573-				                 "PNG not supported: unknown PNG chunk type");
6574-			}
6575-			stbi__skip(s, c.length);
6576-			break;
6577-		}
6578-		// end of PNG chunk, read and skip CRC
6579-		stbi__get32be(s);
6580-	}
6581-}
6582-
6583-static void *
6584-stbi__do_png(stbi__png *p, int *x, int *y, int *n, int req_comp,
6585-             stbi__result_info *ri)
6586-{
6587-	void *result = NULL;
6588-	if (req_comp < 0 || req_comp > 4) {
6589-		return stbi__errpuc("bad req_comp", "Internal error");
6590-	}
6591-	if (stbi__parse_png_file(p, STBI__SCAN_load, req_comp)) {
6592-		if (p->depth <= 8) {
6593-			ri->bits_per_channel = 8;
6594-		} else if (p->depth == 16) {
6595-			ri->bits_per_channel = 16;
6596-		} else {
6597-			return stbi__errpuc("bad bits_per_channel",
6598-			                    "PNG not supported: unsupported color depth");
6599-		}
6600-		result = p->out;
6601-		p->out = NULL;
6602-		if (req_comp && req_comp != p->s->img_out_n) {
6603-			if (ri->bits_per_channel == 8) {
6604-				result = stbi__convert_format((unsigned char *)result,
6605-				                              p->s->img_out_n, req_comp,
6606-				                              p->s->img_x, p->s->img_y);
6607-			} else {
6608-				result = stbi__convert_format16((stbi__uint16 *)result,
6609-				                                p->s->img_out_n, req_comp,
6610-				                                p->s->img_x, p->s->img_y);
6611-			}
6612-			p->s->img_out_n = req_comp;
6613-			if (result == NULL) {
6614-				return result;
6615-			}
6616-		}
6617-		*x = p->s->img_x;
6618-		*y = p->s->img_y;
6619-		if (n) {
6620-			*n = p->s->img_n;
6621-		}
6622-	}
6623-	STBI_FREE(p->out);
6624-	p->out = NULL;
6625-	STBI_FREE(p->expanded);
6626-	p->expanded = NULL;
6627-	STBI_FREE(p->idata);
6628-	p->idata = NULL;
6629-
6630-	return result;
6631-}
6632-
6633-static void *
6634-stbi__png_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
6635-               stbi__result_info *ri)
6636-{
6637-	stbi__png p;
6638-	p.s = s;
6639-	return stbi__do_png(&p, x, y, comp, req_comp, ri);
6640-}
6641-
6642-static int
6643-stbi__png_test(stbi__context *s)
6644-{
6645-	int r;
6646-	r = stbi__check_png_header(s);
6647-	stbi__rewind(s);
6648-	return r;
6649-}
6650-
6651-static int
6652-stbi__png_info_raw(stbi__png *p, int *x, int *y, int *comp)
6653-{
6654-	if (!stbi__parse_png_file(p, STBI__SCAN_header, 0)) {
6655-		stbi__rewind(p->s);
6656-		return 0;
6657-	}
6658-	if (x) {
6659-		*x = p->s->img_x;
6660-	}
6661-	if (y) {
6662-		*y = p->s->img_y;
6663-	}
6664-	if (comp) {
6665-		*comp = p->s->img_n;
6666-	}
6667-	return 1;
6668-}
6669-
6670-static int
6671-stbi__png_info(stbi__context *s, int *x, int *y, int *comp)
6672-{
6673-	stbi__png p;
6674-	p.s = s;
6675-	return stbi__png_info_raw(&p, x, y, comp);
6676-}
6677-
6678-static int
6679-stbi__png_is16(stbi__context *s)
6680-{
6681-	stbi__png p;
6682-	p.s = s;
6683-	if (!stbi__png_info_raw(&p, NULL, NULL, NULL)) {
6684-		return 0;
6685-	}
6686-	if (p.depth != 16) {
6687-		stbi__rewind(p.s);
6688-		return 0;
6689-	}
6690-	return 1;
6691-}
6692-#endif
6693-
6694-// Microsoft/Windows BMP image
6695-
6696-#ifndef STBI_NO_BMP
6697-static int
6698-stbi__bmp_test_raw(stbi__context *s)
6699-{
6700-	int r;
6701-	int sz;
6702-	if (stbi__get8(s) != 'B') {
6703-		return 0;
6704-	}
6705-	if (stbi__get8(s) != 'M') {
6706-		return 0;
6707-	}
6708-	stbi__get32le(s); // discard filesize
6709-	stbi__get16le(s); // discard reserved
6710-	stbi__get16le(s); // discard reserved
6711-	stbi__get32le(s); // discard data offset
6712-	sz = stbi__get32le(s);
6713-	r = (sz == 12 || sz == 40 || sz == 56 || sz == 108 || sz == 124);
6714-	return r;
6715-}
6716-
6717-static int
6718-stbi__bmp_test(stbi__context *s)
6719-{
6720-	int r = stbi__bmp_test_raw(s);
6721-	stbi__rewind(s);
6722-	return r;
6723-}
6724-
6725-// returns 0..31 for the highest set bit
6726-static int
6727-stbi__high_bit(unsigned int z)
6728-{
6729-	int n = 0;
6730-	if (z == 0) {
6731-		return -1;
6732-	}
6733-	if (z >= 0x10000) {
6734-		n += 16;
6735-		z >>= 16;
6736-	}
6737-	if (z >= 0x00100) {
6738-		n += 8;
6739-		z >>= 8;
6740-	}
6741-	if (z >= 0x00010) {
6742-		n += 4;
6743-		z >>= 4;
6744-	}
6745-	if (z >= 0x00004) {
6746-		n += 2;
6747-		z >>= 2;
6748-	}
6749-	if (z >= 0x00002) {
6750-		n += 1; /* >>=  1;*/
6751-	}
6752-	return n;
6753-}
6754-
6755-static int
6756-stbi__bitcount(unsigned int a)
6757-{
6758-	a = (a & 0x55555555) + ((a >> 1) & 0x55555555); // max 2
6759-	a = (a & 0x33333333) + ((a >> 2) & 0x33333333); // max 4
6760-	a = (a + (a >> 4)) & 0x0f0f0f0f;                // max 8 per 4, now 8 bits
6761-	a = (a + (a >> 8));                             // max 16 per 8 bits
6762-	a = (a + (a >> 16));                            // max 32 per 8 bits
6763-	return a & 0xff;
6764-}
6765-
6766-// extract an arbitrarily-aligned N-bit value (N=bits)
6767-// from v, and then make it 8-bits long and fractionally
6768-// extend it to full full range.
6769-static int
6770-stbi__shiftsigned(unsigned int v, int shift, int bits)
6771-{
6772-	static unsigned int mul_table[9] = {
6773-	    0,
6774-	    0xff /*0b11111111*/,
6775-	    0x55 /*0b01010101*/,
6776-	    0x49 /*0b01001001*/,
6777-	    0x11 /*0b00010001*/,
6778-	    0x21 /*0b00100001*/,
6779-	    0x41 /*0b01000001*/,
6780-	    0x81 /*0b10000001*/,
6781-	    0x01 /*0b00000001*/,
6782-	};
6783-	static unsigned int shift_table[9] = {
6784-	    0, 0, 0, 1, 0, 2, 4, 6, 0,
6785-	};
6786-	if (shift < 0) {
6787-		v <<= -shift;
6788-	} else {
6789-		v >>= shift;
6790-	}
6791-	STBI_ASSERT(v < 256);
6792-	v >>= (8 - bits);
6793-	STBI_ASSERT(bits >= 0 && bits <= 8);
6794-	return (int)((unsigned)v * mul_table[bits]) >> shift_table[bits];
6795-}
6796-
6797-typedef struct {
6798-	int bpp, offset, hsz;
6799-	unsigned int mr, mg, mb, ma, all_a;
6800-	int extra_read;
6801-} stbi__bmp_data;
6802-
6803-static int
6804-stbi__bmp_set_mask_defaults(stbi__bmp_data *info, int compress)
6805-{
6806-	// BI_BITFIELDS specifies masks explicitly, don't override
6807-	if (compress == 3) {
6808-		return 1;
6809-	}
6810-
6811-	if (compress == 0) {
6812-		if (info->bpp == 16) {
6813-			info->mr = 31u << 10;
6814-			info->mg = 31u << 5;
6815-			info->mb = 31u << 0;
6816-		} else if (info->bpp == 32) {
6817-			info->mr = 0xffu << 16;
6818-			info->mg = 0xffu << 8;
6819-			info->mb = 0xffu << 0;
6820-			info->ma = 0xffu << 24;
6821-			info->all_a = 0; // if all_a is 0 at end, then we loaded alpha
6822-			                 // channel but it was all 0
6823-		} else {
6824-			// otherwise, use defaults, which is all-0
6825-			info->mr = info->mg = info->mb = info->ma = 0;
6826-		}
6827-		return 1;
6828-	}
6829-	return 0; // error
6830-}
6831-
6832-static void *
6833-stbi__bmp_parse_header(stbi__context *s, stbi__bmp_data *info)
6834-{
6835-	int hsz;
6836-	if (stbi__get8(s) != 'B' || stbi__get8(s) != 'M') {
6837-		return stbi__errpuc("not BMP", "Corrupt BMP");
6838-	}
6839-	stbi__get32le(s); // discard filesize
6840-	stbi__get16le(s); // discard reserved
6841-	stbi__get16le(s); // discard reserved
6842-	info->offset = stbi__get32le(s);
6843-	info->hsz = hsz = stbi__get32le(s);
6844-	info->mr = info->mg = info->mb = info->ma = 0;
6845-	info->extra_read = 14;
6846-
6847-	if (info->offset < 0) {
6848-		return stbi__errpuc("bad BMP", "bad BMP");
6849-	}
6850-
6851-	if (hsz != 12 && hsz != 40 && hsz != 56 && hsz != 108 && hsz != 124) {
6852-		return stbi__errpuc("unknown BMP", "BMP type not supported: unknown");
6853-	}
6854-	if (hsz == 12) {
6855-		s->img_x = stbi__get16le(s);
6856-		s->img_y = stbi__get16le(s);
6857-	} else {
6858-		s->img_x = stbi__get32le(s);
6859-		s->img_y = stbi__get32le(s);
6860-	}
6861-	if (stbi__get16le(s) != 1) {
6862-		return stbi__errpuc("bad BMP", "bad BMP");
6863-	}
6864-	info->bpp = stbi__get16le(s);
6865-	if (hsz != 12) {
6866-		int compress = stbi__get32le(s);
6867-		if (compress == 1 || compress == 2) {
6868-			return stbi__errpuc("BMP RLE", "BMP type not supported: RLE");
6869-		}
6870-		if (compress >= 4) {
6871-			return stbi__errpuc(
6872-			    "BMP JPEG/PNG",
6873-			    "BMP type not supported: unsupported compression"); // this
6874-			                                                        // includes
6875-			                                                        // PNG/JPEG
6876-			                                                        // modes
6877-		}
6878-		if (compress == 3 && info->bpp != 16 && info->bpp != 32) {
6879-			return stbi__errpuc(
6880-			    "bad BMP", "bad BMP"); // bitfields requires 16 or 32 bits/pixel
6881-		}
6882-		stbi__get32le(s); // discard sizeof
6883-		stbi__get32le(s); // discard hres
6884-		stbi__get32le(s); // discard vres
6885-		stbi__get32le(s); // discard colorsused
6886-		stbi__get32le(s); // discard max important
6887-		if (hsz == 40 || hsz == 56) {
6888-			if (hsz == 56) {
6889-				stbi__get32le(s);
6890-				stbi__get32le(s);
6891-				stbi__get32le(s);
6892-				stbi__get32le(s);
6893-			}
6894-			if (info->bpp == 16 || info->bpp == 32) {
6895-				if (compress == 0) {
6896-					stbi__bmp_set_mask_defaults(info, compress);
6897-				} else if (compress == 3) {
6898-					info->mr = stbi__get32le(s);
6899-					info->mg = stbi__get32le(s);
6900-					info->mb = stbi__get32le(s);
6901-					info->extra_read += 12;
6902-					// not documented, but generated by photoshop and handled by
6903-					// mspaint
6904-					if (info->mr == info->mg && info->mg == info->mb) {
6905-						// ?!?!?
6906-						return stbi__errpuc("bad BMP", "bad BMP");
6907-					}
6908-				} else {
6909-					return stbi__errpuc("bad BMP", "bad BMP");
6910-				}
6911-			}
6912-		} else {
6913-			// V4/V5 header
6914-			int i;
6915-			if (hsz != 108 && hsz != 124) {
6916-				return stbi__errpuc("bad BMP", "bad BMP");
6917-			}
6918-			info->mr = stbi__get32le(s);
6919-			info->mg = stbi__get32le(s);
6920-			info->mb = stbi__get32le(s);
6921-			info->ma = stbi__get32le(s);
6922-			if (compress != 3) { // override mr/mg/mb unless in BI_BITFIELDS
6923-				                 // mode, as per docs
6924-				stbi__bmp_set_mask_defaults(info, compress);
6925-			}
6926-			stbi__get32le(s); // discard color space
6927-			for (i = 0; i < 12; ++i) {
6928-				stbi__get32le(s); // discard color space parameters
6929-			}
6930-			if (hsz == 124) {
6931-				stbi__get32le(s); // discard rendering intent
6932-				stbi__get32le(s); // discard offset of profile data
6933-				stbi__get32le(s); // discard size of profile data
6934-				stbi__get32le(s); // discard reserved
6935-			}
6936-		}
6937-	}
6938-	return (void *)1;
6939-}
6940-
6941-static void *
6942-stbi__bmp_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
6943-               stbi__result_info *ri)
6944-{
6945-	stbi_uc *out;
6946-	unsigned int mr = 0, mg = 0, mb = 0, ma = 0, all_a;
6947-	stbi_uc pal[256][4];
6948-	int psize = 0, i, j, width;
6949-	int flip_vertically, pad, target;
6950-	stbi__bmp_data info;
6951-	STBI_NOTUSED(ri);
6952-
6953-	info.all_a = 255;
6954-	if (stbi__bmp_parse_header(s, &info) == NULL) {
6955-		return NULL; // error code already set
6956-	}
6957-
6958-	flip_vertically = ((int)s->img_y) > 0;
6959-	s->img_y = abs((int)s->img_y);
6960-
6961-	if (s->img_y > STBI_MAX_DIMENSIONS) {
6962-		return stbi__errpuc("too large", "Very large image (corrupt?)");
6963-	}
6964-	if (s->img_x > STBI_MAX_DIMENSIONS) {
6965-		return stbi__errpuc("too large", "Very large image (corrupt?)");
6966-	}
6967-
6968-	mr = info.mr;
6969-	mg = info.mg;
6970-	mb = info.mb;
6971-	ma = info.ma;
6972-	all_a = info.all_a;
6973-
6974-	if (info.hsz == 12) {
6975-		if (info.bpp < 24) {
6976-			psize = (info.offset - info.extra_read - 24) / 3;
6977-		}
6978-	} else {
6979-		if (info.bpp < 16) {
6980-			psize = (info.offset - info.extra_read - info.hsz) >> 2;
6981-		}
6982-	}
6983-	if (psize == 0) {
6984-		// accept some number of extra bytes after the header, but if the offset
6985-		// points either to before the header ends or implies a large amount of
6986-		// extra data, reject the file as malformed
6987-		int bytes_read_so_far = s->callback_already_read +
6988-		                        (int)(s->img_buffer - s->img_buffer_original);
6989-		int header_limit =
6990-		    1024; // max we actually read is below 256 bytes currently.
6991-		int extra_data_limit =
6992-		    256 * 4; // what ordinarily goes here is a palette; 256 entries*4
6993-		             // bytes is its max size.
6994-		if (bytes_read_so_far <= 0 || bytes_read_so_far > header_limit) {
6995-			return stbi__errpuc("bad header", "Corrupt BMP");
6996-		}
6997-		// we established that bytes_read_so_far is positive and sensible.
6998-		// the first half of this test rejects offsets that are either too small
6999-		// positives, or negative, and guarantees that info.offset >=
7000-		// bytes_read_so_far > 0. this in turn ensures the number computed in
7001-		// the second half of the test can't overflow.
7002-		if (info.offset < bytes_read_so_far ||
7003-		    info.offset - bytes_read_so_far > extra_data_limit) {
7004-			return stbi__errpuc("bad offset", "Corrupt BMP");
7005-		} else {
7006-			stbi__skip(s, info.offset - bytes_read_so_far);
7007-		}
7008-	}
7009-
7010-	if (info.bpp == 24 && ma == 0xff000000) {
7011-		s->img_n = 3;
7012-	} else {
7013-		s->img_n = ma ? 4 : 3;
7014-	}
7015-	if (req_comp && req_comp >= 3) { // we can directly decode 3 or 4
7016-		target = req_comp;
7017-	} else {
7018-		target = s->img_n; // if they want monochrome, we'll post-convert
7019-	}
7020-
7021-	// sanity-check size
7022-	if (!stbi__mad3sizes_valid(target, s->img_x, s->img_y, 0)) {
7023-		return stbi__errpuc("too large", "Corrupt BMP");
7024-	}
7025-
7026-	out = (stbi_uc *)stbi__malloc_mad3(target, s->img_x, s->img_y, 0);
7027-	if (!out) {
7028-		return stbi__errpuc("outofmem", "Out of memory");
7029-	}
7030-	if (info.bpp < 16) {
7031-		int z = 0;
7032-		if (psize == 0 || psize > 256) {
7033-			STBI_FREE(out);
7034-			return stbi__errpuc("invalid", "Corrupt BMP");
7035-		}
7036-		for (i = 0; i < psize; ++i) {
7037-			pal[i][2] = stbi__get8(s);
7038-			pal[i][1] = stbi__get8(s);
7039-			pal[i][0] = stbi__get8(s);
7040-			if (info.hsz != 12) {
7041-				stbi__get8(s);
7042-			}
7043-			pal[i][3] = 255;
7044-		}
7045-		stbi__skip(s, info.offset - info.extra_read - info.hsz -
7046-		                  psize * (info.hsz == 12 ? 3 : 4));
7047-		if (info.bpp == 1) {
7048-			width = (s->img_x + 7) >> 3;
7049-		} else if (info.bpp == 4) {
7050-			width = (s->img_x + 1) >> 1;
7051-		} else if (info.bpp == 8) {
7052-			width = s->img_x;
7053-		} else {
7054-			STBI_FREE(out);
7055-			return stbi__errpuc("bad bpp", "Corrupt BMP");
7056-		}
7057-		pad = (-width) & 3;
7058-		if (info.bpp == 1) {
7059-			for (j = 0; j < (int)s->img_y; ++j) {
7060-				int bit_offset = 7, v = stbi__get8(s);
7061-				for (i = 0; i < (int)s->img_x; ++i) {
7062-					int color = (v >> bit_offset) & 0x1;
7063-					out[z++] = pal[color][0];
7064-					out[z++] = pal[color][1];
7065-					out[z++] = pal[color][2];
7066-					if (target == 4) {
7067-						out[z++] = 255;
7068-					}
7069-					if (i + 1 == (int)s->img_x) {
7070-						break;
7071-					}
7072-					if ((--bit_offset) < 0) {
7073-						bit_offset = 7;
7074-						v = stbi__get8(s);
7075-					}
7076-				}
7077-				stbi__skip(s, pad);
7078-			}
7079-		} else {
7080-			for (j = 0; j < (int)s->img_y; ++j) {
7081-				for (i = 0; i < (int)s->img_x; i += 2) {
7082-					int v = stbi__get8(s), v2 = 0;
7083-					if (info.bpp == 4) {
7084-						v2 = v & 15;
7085-						v >>= 4;
7086-					}
7087-					out[z++] = pal[v][0];
7088-					out[z++] = pal[v][1];
7089-					out[z++] = pal[v][2];
7090-					if (target == 4) {
7091-						out[z++] = 255;
7092-					}
7093-					if (i + 1 == (int)s->img_x) {
7094-						break;
7095-					}
7096-					v = (info.bpp == 8) ? stbi__get8(s) : v2;
7097-					out[z++] = pal[v][0];
7098-					out[z++] = pal[v][1];
7099-					out[z++] = pal[v][2];
7100-					if (target == 4) {
7101-						out[z++] = 255;
7102-					}
7103-				}
7104-				stbi__skip(s, pad);
7105-			}
7106-		}
7107-	} else {
7108-		int rshift = 0, gshift = 0, bshift = 0, ashift = 0, rcount = 0,
7109-		    gcount = 0, bcount = 0, acount = 0;
7110-		int z = 0;
7111-		int easy = 0;
7112-		stbi__skip(s, info.offset - info.extra_read - info.hsz);
7113-		if (info.bpp == 24) {
7114-			width = 3 * s->img_x;
7115-		} else if (info.bpp == 16) {
7116-			width = 2 * s->img_x;
7117-		} else { /* bpp = 32 and pad = 0 */
7118-			width = 0;
7119-		}
7120-		pad = (-width) & 3;
7121-		if (info.bpp == 24) {
7122-			easy = 1;
7123-		} else if (info.bpp == 32) {
7124-			if (mb == 0xff && mg == 0xff00 && mr == 0x00ff0000 &&
7125-			    ma == 0xff000000) {
7126-				easy = 2;
7127-			}
7128-		}
7129-		if (!easy) {
7130-			if (!mr || !mg || !mb) {
7131-				STBI_FREE(out);
7132-				return stbi__errpuc("bad masks", "Corrupt BMP");
7133-			}
7134-			// right shift amt to put high bit in position #7
7135-			rshift = stbi__high_bit(mr) - 7;
7136-			rcount = stbi__bitcount(mr);
7137-			gshift = stbi__high_bit(mg) - 7;
7138-			gcount = stbi__bitcount(mg);
7139-			bshift = stbi__high_bit(mb) - 7;
7140-			bcount = stbi__bitcount(mb);
7141-			ashift = stbi__high_bit(ma) - 7;
7142-			acount = stbi__bitcount(ma);
7143-			if (rcount > 8 || gcount > 8 || bcount > 8 || acount > 8) {
7144-				STBI_FREE(out);
7145-				return stbi__errpuc("bad masks", "Corrupt BMP");
7146-			}
7147-		}
7148-		for (j = 0; j < (int)s->img_y; ++j) {
7149-			if (easy) {
7150-				for (i = 0; i < (int)s->img_x; ++i) {
7151-					unsigned char a;
7152-					out[z + 2] = stbi__get8(s);
7153-					out[z + 1] = stbi__get8(s);
7154-					out[z + 0] = stbi__get8(s);
7155-					z += 3;
7156-					a = (easy == 2 ? stbi__get8(s) : 255);
7157-					all_a |= a;
7158-					if (target == 4) {
7159-						out[z++] = a;
7160-					}
7161-				}
7162-			} else {
7163-				int bpp = info.bpp;
7164-				for (i = 0; i < (int)s->img_x; ++i) {
7165-					stbi__uint32 v = (bpp == 16 ? (stbi__uint32)stbi__get16le(s)
7166-					                            : stbi__get32le(s));
7167-					unsigned int a;
7168-					out[z++] = STBI__BYTECAST(
7169-					    stbi__shiftsigned(v & mr, rshift, rcount));
7170-					out[z++] = STBI__BYTECAST(
7171-					    stbi__shiftsigned(v & mg, gshift, gcount));
7172-					out[z++] = STBI__BYTECAST(
7173-					    stbi__shiftsigned(v & mb, bshift, bcount));
7174-					a = (ma ? stbi__shiftsigned(v & ma, ashift, acount) : 255);
7175-					all_a |= a;
7176-					if (target == 4) {
7177-						out[z++] = STBI__BYTECAST(a);
7178-					}
7179-				}
7180-			}
7181-			stbi__skip(s, pad);
7182-		}
7183-	}
7184-
7185-	// if alpha channel is all 0s, replace with all 255s
7186-	if (target == 4 && all_a == 0) {
7187-		for (i = 4 * s->img_x * s->img_y - 1; i >= 0; i -= 4) {
7188-			out[i] = 255;
7189-		}
7190-	}
7191-
7192-	if (flip_vertically) {
7193-		stbi_uc t;
7194-		for (j = 0; j < (int)s->img_y >> 1; ++j) {
7195-			stbi_uc *p1 = out + j * s->img_x * target;
7196-			stbi_uc *p2 = out + (s->img_y - 1 - j) * s->img_x * target;
7197-			for (i = 0; i < (int)s->img_x * target; ++i) {
7198-				t = p1[i];
7199-				p1[i] = p2[i];
7200-				p2[i] = t;
7201-			}
7202-		}
7203-	}
7204-
7205-	if (req_comp && req_comp != target) {
7206-		out = stbi__convert_format(out, target, req_comp, s->img_x, s->img_y);
7207-		if (out == NULL) {
7208-			return out; // stbi__convert_format frees input on failure
7209-		}
7210-	}
7211-
7212-	*x = s->img_x;
7213-	*y = s->img_y;
7214-	if (comp) {
7215-		*comp = s->img_n;
7216-	}
7217-	return out;
7218-}
7219-#endif
7220-
7221-// Targa Truevision - TGA
7222-// by Jonathan Dummer
7223-#ifndef STBI_NO_TGA
7224-// returns STBI_rgb or whatever, 0 on error
7225-static int
7226-stbi__tga_get_comp(int bits_per_pixel, int is_grey, int *is_rgb16)
7227-{
7228-	// only RGB or RGBA (incl. 16bit) or grey allowed
7229-	if (is_rgb16) {
7230-		*is_rgb16 = 0;
7231-	}
7232-	switch (bits_per_pixel) {
7233-	case 8:
7234-		return STBI_grey;
7235-	case 16:
7236-		if (is_grey) {
7237-			return STBI_grey_alpha;
7238-		}
7239-		// fallthrough
7240-	case 15:
7241-		if (is_rgb16) {
7242-			*is_rgb16 = 1;
7243-		}
7244-		return STBI_rgb;
7245-	case 24: // fallthrough
7246-	case 32:
7247-		return bits_per_pixel / 8;
7248-	default:
7249-		return 0;
7250-	}
7251-}
7252-
7253-static int
7254-stbi__tga_info(stbi__context *s, int *x, int *y, int *comp)
7255-{
7256-	int tga_w, tga_h, tga_comp, tga_image_type, tga_bits_per_pixel,
7257-	    tga_colormap_bpp;
7258-	int sz, tga_colormap_type;
7259-	stbi__get8(s);                     // discard Offset
7260-	tga_colormap_type = stbi__get8(s); // colormap type
7261-	if (tga_colormap_type > 1) {
7262-		stbi__rewind(s);
7263-		return 0; // only RGB or indexed allowed
7264-	}
7265-	tga_image_type = stbi__get8(s); // image type
7266-	if (tga_colormap_type == 1) {   // colormapped (paletted) image
7267-		if (tga_image_type != 1 && tga_image_type != 9) {
7268-			stbi__rewind(s);
7269-			return 0;
7270-		}
7271-		stbi__skip(
7272-		    s, 4); // skip index of first colormap entry and number of entries
7273-		sz = stbi__get8(s); //   check bits per palette color entry
7274-		if ((sz != 8) && (sz != 15) && (sz != 16) && (sz != 24) && (sz != 32)) {
7275-			stbi__rewind(s);
7276-			return 0;
7277-		}
7278-		stbi__skip(s, 4); // skip image x and y origin
7279-		tga_colormap_bpp = sz;
7280-	} else { // "normal" image w/o colormap - only RGB or grey allowed, +/- RLE
7281-		if ((tga_image_type != 2) && (tga_image_type != 3) &&
7282-		    (tga_image_type != 10) && (tga_image_type != 11)) {
7283-			stbi__rewind(s);
7284-			return 0; // only RGB or grey allowed, +/- RLE
7285-		}
7286-		stbi__skip(s, 9); // skip colormap specification and image x/y origin
7287-		tga_colormap_bpp = 0;
7288-	}
7289-	tga_w = stbi__get16le(s);
7290-	if (tga_w < 1) {
7291-		stbi__rewind(s);
7292-		return 0; // test width
7293-	}
7294-	tga_h = stbi__get16le(s);
7295-	if (tga_h < 1) {
7296-		stbi__rewind(s);
7297-		return 0; // test height
7298-	}
7299-	tga_bits_per_pixel = stbi__get8(s); // bits per pixel
7300-	stbi__get8(s);                      // ignore alpha bits
7301-	if (tga_colormap_bpp != 0) {
7302-		if ((tga_bits_per_pixel != 8) && (tga_bits_per_pixel != 16)) {
7303-			// when using a colormap, tga_bits_per_pixel is the size of the
7304-			// indexes I don't think anything but 8 or 16bit indexes makes sense
7305-			stbi__rewind(s);
7306-			return 0;
7307-		}
7308-		tga_comp = stbi__tga_get_comp(tga_colormap_bpp, 0, NULL);
7309-	} else {
7310-		tga_comp = stbi__tga_get_comp(
7311-		    tga_bits_per_pixel, (tga_image_type == 3) || (tga_image_type == 11),
7312-		    NULL);
7313-	}
7314-	if (!tga_comp) {
7315-		stbi__rewind(s);
7316-		return 0;
7317-	}
7318-	if (x) {
7319-		*x = tga_w;
7320-	}
7321-	if (y) {
7322-		*y = tga_h;
7323-	}
7324-	if (comp) {
7325-		*comp = tga_comp;
7326-	}
7327-	return 1; // seems to have passed everything
7328-}
7329-
7330-static int
7331-stbi__tga_test(stbi__context *s)
7332-{
7333-	int res = 0;
7334-	int sz, tga_color_type;
7335-	stbi__get8(s);                  //   discard Offset
7336-	tga_color_type = stbi__get8(s); //   color type
7337-	if (tga_color_type > 1) {
7338-		goto errorEnd; //   only RGB or indexed allowed
7339-	}
7340-	sz = stbi__get8(s);        //   image type
7341-	if (tga_color_type == 1) { // colormapped (paletted) image
7342-		if (sz != 1 && sz != 9) {
7343-			goto errorEnd; // colortype 1 demands image type 1 or 9
7344-		}
7345-		stbi__skip(
7346-		    s, 4); // skip index of first colormap entry and number of entries
7347-		sz = stbi__get8(s); //   check bits per palette color entry
7348-		if ((sz != 8) && (sz != 15) && (sz != 16) && (sz != 24) && (sz != 32)) {
7349-			goto errorEnd;
7350-		}
7351-		stbi__skip(s, 4); // skip image x and y origin
7352-	} else {              // "normal" image w/o colormap
7353-		if ((sz != 2) && (sz != 3) && (sz != 10) && (sz != 11)) {
7354-			goto errorEnd; // only RGB or grey allowed, +/- RLE
7355-		}
7356-		stbi__skip(s, 9); // skip colormap specification and image x/y origin
7357-	}
7358-	if (stbi__get16le(s) < 1) {
7359-		goto errorEnd; //   test width
7360-	}
7361-	if (stbi__get16le(s) < 1) {
7362-		goto errorEnd; //   test height
7363-	}
7364-	sz = stbi__get8(s); //   bits per pixel
7365-	if ((tga_color_type == 1) && (sz != 8) && (sz != 16)) {
7366-		goto errorEnd; // for colormapped images, bpp is size of an index
7367-	}
7368-	if ((sz != 8) && (sz != 15) && (sz != 16) && (sz != 24) && (sz != 32)) {
7369-		goto errorEnd;
7370-	}
7371-
7372-	res = 1; // if we got this far, everything's good and we can return 1
7373-	         // instead of 0
7374-
7375-errorEnd:
7376-	stbi__rewind(s);
7377-	return res;
7378-}
7379-
7380-// read 16bit value and convert to 24bit RGB
7381-static void
7382-stbi__tga_read_rgb16(stbi__context *s, stbi_uc *out)
7383-{
7384-	stbi__uint16 px = (stbi__uint16)stbi__get16le(s);
7385-	stbi__uint16 fiveBitMask = 31;
7386-	// we have 3 channels with 5bits each
7387-	int r = (px >> 10) & fiveBitMask;
7388-	int g = (px >> 5) & fiveBitMask;
7389-	int b = px & fiveBitMask;
7390-	// Note that this saves the data in RGB(A) order, so it doesn't need to be
7391-	// swapped later
7392-	out[0] = (stbi_uc)((r * 255) / 31);
7393-	out[1] = (stbi_uc)((g * 255) / 31);
7394-	out[2] = (stbi_uc)((b * 255) / 31);
7395-
7396-	// some people claim that the most significant bit might be used for alpha
7397-	// (possibly if an alpha-bit is set in the "image descriptor byte")
7398-	// but that only made 16bit test images completely translucent..
7399-	// so let's treat all 15 and 16bit TGAs as RGB with no alpha.
7400-}
7401-
7402-static void *
7403-stbi__tga_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
7404-               stbi__result_info *ri)
7405-{
7406-	//   read in the TGA header stuff
7407-	int tga_offset = stbi__get8(s);
7408-	int tga_indexed = stbi__get8(s);
7409-	int tga_image_type = stbi__get8(s);
7410-	int tga_is_RLE = 0;
7411-	int tga_palette_start = stbi__get16le(s);
7412-	int tga_palette_len = stbi__get16le(s);
7413-	int tga_palette_bits = stbi__get8(s);
7414-	int tga_x_origin = stbi__get16le(s);
7415-	int tga_y_origin = stbi__get16le(s);
7416-	int tga_width = stbi__get16le(s);
7417-	int tga_height = stbi__get16le(s);
7418-	int tga_bits_per_pixel = stbi__get8(s);
7419-	int tga_comp, tga_rgb16 = 0;
7420-	int tga_inverted = stbi__get8(s);
7421-	// int tga_alpha_bits = tga_inverted & 15; // the 4 lowest bits - unused
7422-	// (useless?)
7423-	//   image data
7424-	unsigned char *tga_data;
7425-	unsigned char *tga_palette = NULL;
7426-	int i, j;
7427-	unsigned char raw_data[4] = {0};
7428-	int RLE_count = 0;
7429-	int RLE_repeating = 0;
7430-	int read_next_pixel = 1;
7431-	STBI_NOTUSED(ri);
7432-	STBI_NOTUSED(tga_x_origin); // @TODO
7433-	STBI_NOTUSED(tga_y_origin); // @TODO
7434-
7435-	if (tga_height > STBI_MAX_DIMENSIONS) {
7436-		return stbi__errpuc("too large", "Very large image (corrupt?)");
7437-	}
7438-	if (tga_width > STBI_MAX_DIMENSIONS) {
7439-		return stbi__errpuc("too large", "Very large image (corrupt?)");
7440-	}
7441-
7442-	//   do a tiny bit of precessing
7443-	if (tga_image_type >= 8) {
7444-		tga_image_type -= 8;
7445-		tga_is_RLE = 1;
7446-	}
7447-	tga_inverted = 1 - ((tga_inverted >> 5) & 1);
7448-
7449-	//   If I'm paletted, then I'll use the number of bits from the palette
7450-	if (tga_indexed) {
7451-		tga_comp = stbi__tga_get_comp(tga_palette_bits, 0, &tga_rgb16);
7452-	} else {
7453-		tga_comp = stbi__tga_get_comp(tga_bits_per_pixel, (tga_image_type == 3),
7454-		                              &tga_rgb16);
7455-	}
7456-
7457-	if (!tga_comp) { // shouldn't really happen, stbi__tga_test() should have
7458-		             // ensured basic consistency
7459-		return stbi__errpuc("bad format", "Can't find out TGA pixelformat");
7460-	}
7461-
7462-	//   tga info
7463-	*x = tga_width;
7464-	*y = tga_height;
7465-	if (comp) {
7466-		*comp = tga_comp;
7467-	}
7468-
7469-	if (!stbi__mad3sizes_valid(tga_width, tga_height, tga_comp, 0)) {
7470-		return stbi__errpuc("too large", "Corrupt TGA");
7471-	}
7472-
7473-	tga_data =
7474-	    (unsigned char *)stbi__malloc_mad3(tga_width, tga_height, tga_comp, 0);
7475-	if (!tga_data) {
7476-		return stbi__errpuc("outofmem", "Out of memory");
7477-	}
7478-
7479-	// skip to the data's starting position (offset usually = 0)
7480-	stbi__skip(s, tga_offset);
7481-
7482-	if (!tga_indexed && !tga_is_RLE && !tga_rgb16) {
7483-		for (i = 0; i < tga_height; ++i) {
7484-			int row = tga_inverted ? tga_height - i - 1 : i;
7485-			stbi_uc *tga_row = tga_data + row * tga_width * tga_comp;
7486-			stbi__getn(s, tga_row, tga_width * tga_comp);
7487-		}
7488-	} else {
7489-		//   do I need to load a palette?
7490-		if (tga_indexed) {
7491-			if (tga_palette_len ==
7492-			    0) { /* you have to have at least one entry! */
7493-				STBI_FREE(tga_data);
7494-				return stbi__errpuc("bad palette", "Corrupt TGA");
7495-			}
7496-
7497-			//   any data to skip? (offset usually = 0)
7498-			stbi__skip(s, tga_palette_start);
7499-			//   load the palette
7500-			tga_palette = (unsigned char *)stbi__malloc_mad2(tga_palette_len,
7501-			                                                 tga_comp, 0);
7502-			if (!tga_palette) {
7503-				STBI_FREE(tga_data);
7504-				return stbi__errpuc("outofmem", "Out of memory");
7505-			}
7506-			if (tga_rgb16) {
7507-				stbi_uc *pal_entry = tga_palette;
7508-				STBI_ASSERT(tga_comp == STBI_rgb);
7509-				for (i = 0; i < tga_palette_len; ++i) {
7510-					stbi__tga_read_rgb16(s, pal_entry);
7511-					pal_entry += tga_comp;
7512-				}
7513-			} else if (!stbi__getn(s, tga_palette,
7514-			                       tga_palette_len * tga_comp)) {
7515-				STBI_FREE(tga_data);
7516-				STBI_FREE(tga_palette);
7517-				return stbi__errpuc("bad palette", "Corrupt TGA");
7518-			}
7519-		}
7520-		//   load the data
7521-		for (i = 0; i < tga_width * tga_height; ++i) {
7522-			//   if I'm in RLE mode, do I need to get a RLE stbi__pngchunk?
7523-			if (tga_is_RLE) {
7524-				if (RLE_count == 0) {
7525-					//   yep, get the next byte as a RLE command
7526-					int RLE_cmd = stbi__get8(s);
7527-					RLE_count = 1 + (RLE_cmd & 127);
7528-					RLE_repeating = RLE_cmd >> 7;
7529-					read_next_pixel = 1;
7530-				} else if (!RLE_repeating) {
7531-					read_next_pixel = 1;
7532-				}
7533-			} else {
7534-				read_next_pixel = 1;
7535-			}
7536-			//   OK, if I need to read a pixel, do it now
7537-			if (read_next_pixel) {
7538-				//   load however much data we did have
7539-				if (tga_indexed) {
7540-					// read in index, then perform the lookup
7541-					int pal_idx = (tga_bits_per_pixel == 8) ? stbi__get8(s)
7542-					                                        : stbi__get16le(s);
7543-					if (pal_idx >= tga_palette_len) {
7544-						// invalid index
7545-						pal_idx = 0;
7546-					}
7547-					pal_idx *= tga_comp;
7548-					for (j = 0; j < tga_comp; ++j) {
7549-						raw_data[j] = tga_palette[pal_idx + j];
7550-					}
7551-				} else if (tga_rgb16) {
7552-					STBI_ASSERT(tga_comp == STBI_rgb);
7553-					stbi__tga_read_rgb16(s, raw_data);
7554-				} else {
7555-					//   read in the data raw
7556-					for (j = 0; j < tga_comp; ++j) {
7557-						raw_data[j] = stbi__get8(s);
7558-					}
7559-				}
7560-				//   clear the reading flag for the next pixel
7561-				read_next_pixel = 0;
7562-			} // end of reading a pixel
7563-
7564-			// copy data
7565-			for (j = 0; j < tga_comp; ++j) {
7566-				tga_data[i * tga_comp + j] = raw_data[j];
7567-			}
7568-
7569-			//   in case we're in RLE mode, keep counting down
7570-			--RLE_count;
7571-		}
7572-		//   do I need to invert the image?
7573-		if (tga_inverted) {
7574-			for (j = 0; j * 2 < tga_height; ++j) {
7575-				int index1 = j * tga_width * tga_comp;
7576-				int index2 = (tga_height - 1 - j) * tga_width * tga_comp;
7577-				for (i = tga_width * tga_comp; i > 0; --i) {
7578-					unsigned char temp = tga_data[index1];
7579-					tga_data[index1] = tga_data[index2];
7580-					tga_data[index2] = temp;
7581-					++index1;
7582-					++index2;
7583-				}
7584-			}
7585-		}
7586-		//   clear my palette, if I had one
7587-		if (tga_palette != NULL) {
7588-			STBI_FREE(tga_palette);
7589-		}
7590-	}
7591-
7592-	// swap RGB - if the source data was RGB16, it already is in the right order
7593-	if (tga_comp >= 3 && !tga_rgb16) {
7594-		unsigned char *tga_pixel = tga_data;
7595-		for (i = 0; i < tga_width * tga_height; ++i) {
7596-			unsigned char temp = tga_pixel[0];
7597-			tga_pixel[0] = tga_pixel[2];
7598-			tga_pixel[2] = temp;
7599-			tga_pixel += tga_comp;
7600-		}
7601-	}
7602-
7603-	// convert to target component count
7604-	if (req_comp && req_comp != tga_comp) {
7605-		tga_data = stbi__convert_format(tga_data, tga_comp, req_comp, tga_width,
7606-		                                tga_height);
7607-	}
7608-
7609-	//   the things I do to get rid of an error message, and yet keep
7610-	//   Microsoft's C compilers happy... [8^(
7611-	tga_palette_start = tga_palette_len = tga_palette_bits = tga_x_origin =
7612-	    tga_y_origin = 0;
7613-	STBI_NOTUSED(tga_palette_start);
7614-	//   OK, done
7615-	return tga_data;
7616-}
7617-#endif
7618-
7619-// *************************************************************************************************
7620-// Photoshop PSD loader -- PD by Thatcher Ulrich, integration by Nicolas Schulz,
7621-// tweaked by STB
7622-
7623-#ifndef STBI_NO_PSD
7624-static int
7625-stbi__psd_test(stbi__context *s)
7626-{
7627-	int r = (stbi__get32be(s) == 0x38425053);
7628-	stbi__rewind(s);
7629-	return r;
7630-}
7631-
7632-static int
7633-stbi__psd_decode_rle(stbi__context *s, stbi_uc *p, int pixelCount)
7634-{
7635-	int count, nleft, len;
7636-
7637-	count = 0;
7638-	while ((nleft = pixelCount - count) > 0) {
7639-		len = stbi__get8(s);
7640-		if (len == 128) {
7641-			// No-op.
7642-		} else if (len < 128) {
7643-			// Copy next len+1 bytes literally.
7644-			len++;
7645-			if (len > nleft) {
7646-				return 0; // corrupt data
7647-			}
7648-			count += len;
7649-			while (len) {
7650-				*p = stbi__get8(s);
7651-				p += 4;
7652-				len--;
7653-			}
7654-		} else if (len > 128) {
7655-			stbi_uc val;
7656-			// Next -len+1 bytes in the dest are replicated from next source
7657-			// byte. (Interpret len as a negative 8-bit int.)
7658-			len = 257 - len;
7659-			if (len > nleft) {
7660-				return 0; // corrupt data
7661-			}
7662-			val = stbi__get8(s);
7663-			count += len;
7664-			while (len) {
7665-				*p = val;
7666-				p += 4;
7667-				len--;
7668-			}
7669-		}
7670-	}
7671-
7672-	return 1;
7673-}
7674-
7675-static void *
7676-stbi__psd_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
7677-               stbi__result_info *ri, int bpc)
7678-{
7679-	int pixelCount;
7680-	int channelCount, compression;
7681-	int channel, i;
7682-	int bitdepth;
7683-	int w, h;
7684-	stbi_uc *out;
7685-	STBI_NOTUSED(ri);
7686-
7687-	// Check identifier
7688-	if (stbi__get32be(s) != 0x38425053) { // "8BPS"
7689-		return stbi__errpuc("not PSD", "Corrupt PSD image");
7690-	}
7691-
7692-	// Check file type version.
7693-	if (stbi__get16be(s) != 1) {
7694-		return stbi__errpuc("wrong version",
7695-		                    "Unsupported version of PSD image");
7696-	}
7697-
7698-	// Skip 6 reserved bytes.
7699-	stbi__skip(s, 6);
7700-
7701-	// Read the number of channels (R, G, B, A, etc).
7702-	channelCount = stbi__get16be(s);
7703-	if (channelCount < 0 || channelCount > 16) {
7704-		return stbi__errpuc("wrong channel count",
7705-		                    "Unsupported number of channels in PSD image");
7706-	}
7707-
7708-	// Read the rows and columns of the image.
7709-	h = stbi__get32be(s);
7710-	w = stbi__get32be(s);
7711-
7712-	if (h > STBI_MAX_DIMENSIONS) {
7713-		return stbi__errpuc("too large", "Very large image (corrupt?)");
7714-	}
7715-	if (w > STBI_MAX_DIMENSIONS) {
7716-		return stbi__errpuc("too large", "Very large image (corrupt?)");
7717-	}
7718-
7719-	// Make sure the depth is 8 bits.
7720-	bitdepth = stbi__get16be(s);
7721-	if (bitdepth != 8 && bitdepth != 16) {
7722-		return stbi__errpuc("unsupported bit depth",
7723-		                    "PSD bit depth is not 8 or 16 bit");
7724-	}
7725-
7726-	// Make sure the color mode is RGB.
7727-	// Valid options are:
7728-	//   0: Bitmap
7729-	//   1: Grayscale
7730-	//   2: Indexed color
7731-	//   3: RGB color
7732-	//   4: CMYK color
7733-	//   7: Multichannel
7734-	//   8: Duotone
7735-	//   9: Lab color
7736-	if (stbi__get16be(s) != 3) {
7737-		return stbi__errpuc("wrong color format",
7738-		                    "PSD is not in RGB color format");
7739-	}
7740-
7741-	// Skip the Mode Data.  (It's the palette for indexed color; other info for
7742-	// other modes.)
7743-	stbi__skip(s, stbi__get32be(s));
7744-
7745-	// Skip the image resources.  (resolution, pen tool paths, etc)
7746-	stbi__skip(s, stbi__get32be(s));
7747-
7748-	// Skip the reserved data.
7749-	stbi__skip(s, stbi__get32be(s));
7750-
7751-	// Find out if the data is compressed.
7752-	// Known values:
7753-	//   0: no compression
7754-	//   1: RLE compressed
7755-	compression = stbi__get16be(s);
7756-	if (compression > 1) {
7757-		return stbi__errpuc("bad compression",
7758-		                    "PSD has an unknown compression format");
7759-	}
7760-
7761-	// Check size
7762-	if (!stbi__mad3sizes_valid(4, w, h, 0)) {
7763-		return stbi__errpuc("too large", "Corrupt PSD");
7764-	}
7765-
7766-	// Create the destination image.
7767-
7768-	if (!compression && bitdepth == 16 && bpc == 16) {
7769-		out = (stbi_uc *)stbi__malloc_mad3(8, w, h, 0);
7770-		ri->bits_per_channel = 16;
7771-	} else {
7772-		out = (stbi_uc *)stbi__malloc(4 * w * h);
7773-	}
7774-
7775-	if (!out) {
7776-		return stbi__errpuc("outofmem", "Out of memory");
7777-	}
7778-	pixelCount = w * h;
7779-
7780-	// Initialize the data to zero.
7781-	// memset( out, 0, pixelCount * 4 );
7782-
7783-	// Finally, the image data.
7784-	if (compression) {
7785-		// RLE as used by .PSD and .TIFF
7786-		// Loop until you get the number of unpacked bytes you are expecting:
7787-		//     Read the next source byte into n.
7788-		//     If n is between 0 and 127 inclusive, copy the next n+1 bytes
7789-		//     literally. Else if n is between -127 and -1 inclusive, copy the
7790-		//     next byte -n+1 times. Else if n is 128, noop.
7791-		// Endloop
7792-
7793-		// The RLE-compressed data is preceded by a 2-byte data count for each
7794-		// row in the data, which we're going to just skip.
7795-		stbi__skip(s, h * channelCount * 2);
7796-
7797-		// Read the RLE data by channel.
7798-		for (channel = 0; channel < 4; channel++) {
7799-			stbi_uc *p;
7800-
7801-			p = out + channel;
7802-			if (channel >= channelCount) {
7803-				// Fill this channel with default data.
7804-				for (i = 0; i < pixelCount; i++, p += 4) {
7805-					*p = (channel == 3 ? 255 : 0);
7806-				}
7807-			} else {
7808-				// Read the RLE data.
7809-				if (!stbi__psd_decode_rle(s, p, pixelCount)) {
7810-					STBI_FREE(out);
7811-					return stbi__errpuc("corrupt", "bad RLE data");
7812-				}
7813-			}
7814-		}
7815-
7816-	} else {
7817-		// We're at the raw image data.  It's each channel in order (Red, Green,
7818-		// Blue, Alpha, ...) where each channel consists of an 8-bit (or 16-bit)
7819-		// value for each pixel in the image.
7820-
7821-		// Read the data by channel.
7822-		for (channel = 0; channel < 4; channel++) {
7823-			if (channel >= channelCount) {
7824-				// Fill this channel with default data.
7825-				if (bitdepth == 16 && bpc == 16) {
7826-					stbi__uint16 *q = ((stbi__uint16 *)out) + channel;
7827-					stbi__uint16 val = channel == 3 ? 65535 : 0;
7828-					for (i = 0; i < pixelCount; i++, q += 4) {
7829-						*q = val;
7830-					}
7831-				} else {
7832-					stbi_uc *p = out + channel;
7833-					stbi_uc val = channel == 3 ? 255 : 0;
7834-					for (i = 0; i < pixelCount; i++, p += 4) {
7835-						*p = val;
7836-					}
7837-				}
7838-			} else {
7839-				if (ri->bits_per_channel == 16) { // output bpc
7840-					stbi__uint16 *q = ((stbi__uint16 *)out) + channel;
7841-					for (i = 0; i < pixelCount; i++, q += 4) {
7842-						*q = (stbi__uint16)stbi__get16be(s);
7843-					}
7844-				} else {
7845-					stbi_uc *p = out + channel;
7846-					if (bitdepth == 16) { // input bpc
7847-						for (i = 0; i < pixelCount; i++, p += 4) {
7848-							*p = (stbi_uc)(stbi__get16be(s) >> 8);
7849-						}
7850-					} else {
7851-						for (i = 0; i < pixelCount; i++, p += 4) {
7852-							*p = stbi__get8(s);
7853-						}
7854-					}
7855-				}
7856-			}
7857-		}
7858-	}
7859-
7860-	// remove weird white matte from PSD
7861-	if (channelCount >= 4) {
7862-		if (ri->bits_per_channel == 16) {
7863-			for (i = 0; i < w * h; ++i) {
7864-				stbi__uint16 *pixel = (stbi__uint16 *)out + 4 * i;
7865-				if (pixel[3] != 0 && pixel[3] != 65535) {
7866-					float a = pixel[3] / 65535.0f;
7867-					float ra = 1.0f / a;
7868-					float inv_a = 65535.0f * (1 - ra);
7869-					pixel[0] = (stbi__uint16)(pixel[0] * ra + inv_a);
7870-					pixel[1] = (stbi__uint16)(pixel[1] * ra + inv_a);
7871-					pixel[2] = (stbi__uint16)(pixel[2] * ra + inv_a);
7872-				}
7873-			}
7874-		} else {
7875-			for (i = 0; i < w * h; ++i) {
7876-				unsigned char *pixel = out + 4 * i;
7877-				if (pixel[3] != 0 && pixel[3] != 255) {
7878-					float a = pixel[3] / 255.0f;
7879-					float ra = 1.0f / a;
7880-					float inv_a = 255.0f * (1 - ra);
7881-					pixel[0] = (unsigned char)(pixel[0] * ra + inv_a);
7882-					pixel[1] = (unsigned char)(pixel[1] * ra + inv_a);
7883-					pixel[2] = (unsigned char)(pixel[2] * ra + inv_a);
7884-				}
7885-			}
7886-		}
7887-	}
7888-
7889-	// convert to desired output format
7890-	if (req_comp && req_comp != 4) {
7891-		if (ri->bits_per_channel == 16) {
7892-			out = (stbi_uc *)stbi__convert_format16((stbi__uint16 *)out, 4,
7893-			                                        req_comp, w, h);
7894-		} else {
7895-			out = stbi__convert_format(out, 4, req_comp, w, h);
7896-		}
7897-		if (out == NULL) {
7898-			return out; // stbi__convert_format frees input on failure
7899-		}
7900-	}
7901-
7902-	if (comp) {
7903-		*comp = 4;
7904-	}
7905-	*y = h;
7906-	*x = w;
7907-
7908-	return out;
7909-}
7910-#endif
7911-
7912-// *************************************************************************************************
7913-// Softimage PIC loader
7914-// by Tom Seddon
7915-//
7916-// See http://softimage.wiki.softimage.com/index.php/INFO:_PIC_file_format
7917-// See http://ozviz.wasp.uwa.edu.au/~pbourke/dataformats/softimagepic/
7918-
7919-#ifndef STBI_NO_PIC
7920-static int
7921-stbi__pic_is4(stbi__context *s, const char *str)
7922-{
7923-	int i;
7924-	for (i = 0; i < 4; ++i) {
7925-		if (stbi__get8(s) != (stbi_uc)str[i]) {
7926-			return 0;
7927-		}
7928-	}
7929-
7930-	return 1;
7931-}
7932-
7933-static int
7934-stbi__pic_test_core(stbi__context *s)
7935-{
7936-	int i;
7937-
7938-	if (!stbi__pic_is4(s, "\x53\x80\xF6\x34")) {
7939-		return 0;
7940-	}
7941-
7942-	for (i = 0; i < 84; ++i) {
7943-		stbi__get8(s);
7944-	}
7945-
7946-	if (!stbi__pic_is4(s, "PICT")) {
7947-		return 0;
7948-	}
7949-
7950-	return 1;
7951-}
7952-
7953-typedef struct {
7954-	stbi_uc size, type, channel;
7955-} stbi__pic_packet;
7956-
7957-static stbi_uc *
7958-stbi__readval(stbi__context *s, int channel, stbi_uc *dest)
7959-{
7960-	int mask = 0x80, i;
7961-
7962-	for (i = 0; i < 4; ++i, mask >>= 1) {
7963-		if (channel & mask) {
7964-			if (stbi__at_eof(s)) {
7965-				return stbi__errpuc("bad file", "PIC file too short");
7966-			}
7967-			dest[i] = stbi__get8(s);
7968-		}
7969-	}
7970-
7971-	return dest;
7972-}
7973-
7974-static void
7975-stbi__copyval(int channel, stbi_uc *dest, const stbi_uc *src)
7976-{
7977-	int mask = 0x80, i;
7978-
7979-	for (i = 0; i < 4; ++i, mask >>= 1) {
7980-		if (channel & mask) {
7981-			dest[i] = src[i];
7982-		}
7983-	}
7984-}
7985-
7986-static stbi_uc *
7987-stbi__pic_load_core(stbi__context *s, int width, int height, int *comp,
7988-                    stbi_uc *result)
7989-{
7990-	int act_comp = 0, num_packets = 0, y, chained;
7991-	stbi__pic_packet packets[10];
7992-
7993-	// this will (should...) cater for even some bizarre stuff like having data
7994-	// for the same channel in multiple packets.
7995-	do {
7996-		stbi__pic_packet *packet;
7997-
7998-		if (num_packets == sizeof(packets) / sizeof(packets[0])) {
7999-			return stbi__errpuc("bad format", "too many packets");
8000-		}
8001-
8002-		packet = &packets[num_packets++];
8003-
8004-		chained = stbi__get8(s);
8005-		packet->size = stbi__get8(s);
8006-		packet->type = stbi__get8(s);
8007-		packet->channel = stbi__get8(s);
8008-
8009-		act_comp |= packet->channel;
8010-
8011-		if (stbi__at_eof(s)) {
8012-			return stbi__errpuc("bad file", "file too short (reading packets)");
8013-		}
8014-		if (packet->size != 8) {
8015-			return stbi__errpuc("bad format", "packet isn't 8bpp");
8016-		}
8017-	} while (chained);
8018-
8019-	*comp = (act_comp & 0x10 ? 4 : 3); // has alpha channel?
8020-
8021-	for (y = 0; y < height; ++y) {
8022-		int packet_idx;
8023-
8024-		for (packet_idx = 0; packet_idx < num_packets; ++packet_idx) {
8025-			stbi__pic_packet *packet = &packets[packet_idx];
8026-			stbi_uc *dest = result + y * width * 4;
8027-
8028-			switch (packet->type) {
8029-			default:
8030-				return stbi__errpuc("bad format",
8031-				                    "packet has bad compression type");
8032-
8033-			case 0: { // uncompressed
8034-				int x;
8035-
8036-				for (x = 0; x < width; ++x, dest += 4) {
8037-					if (!stbi__readval(s, packet->channel, dest)) {
8038-						return 0;
8039-					}
8040-				}
8041-				break;
8042-			}
8043-
8044-			case 1: // Pure RLE
8045-			{
8046-				int left = width, i;
8047-
8048-				while (left > 0) {
8049-					stbi_uc count, value[4];
8050-
8051-					count = stbi__get8(s);
8052-					if (stbi__at_eof(s)) {
8053-						return stbi__errpuc("bad file",
8054-						                    "file too short (pure read count)");
8055-					}
8056-
8057-					if (count > left) {
8058-						count = (stbi_uc)left;
8059-					}
8060-
8061-					if (!stbi__readval(s, packet->channel, value)) {
8062-						return 0;
8063-					}
8064-
8065-					for (i = 0; i < count; ++i, dest += 4) {
8066-						stbi__copyval(packet->channel, dest, value);
8067-					}
8068-					left -= count;
8069-				}
8070-			} break;
8071-
8072-			case 2: { // Mixed RLE
8073-				int left = width;
8074-				while (left > 0) {
8075-					int count = stbi__get8(s), i;
8076-					if (stbi__at_eof(s)) {
8077-						return stbi__errpuc(
8078-						    "bad file", "file too short (mixed read count)");
8079-					}
8080-
8081-					if (count >= 128) { // Repeated
8082-						stbi_uc value[4];
8083-
8084-						if (count == 128) {
8085-							count = stbi__get16be(s);
8086-						} else {
8087-							count -= 127;
8088-						}
8089-						if (count > left) {
8090-							return stbi__errpuc("bad file", "scanline overrun");
8091-						}
8092-
8093-						if (!stbi__readval(s, packet->channel, value)) {
8094-							return 0;
8095-						}
8096-
8097-						for (i = 0; i < count; ++i, dest += 4) {
8098-							stbi__copyval(packet->channel, dest, value);
8099-						}
8100-					} else { // Raw
8101-						++count;
8102-						if (count > left) {
8103-							return stbi__errpuc("bad file", "scanline overrun");
8104-						}
8105-
8106-						for (i = 0; i < count; ++i, dest += 4) {
8107-							if (!stbi__readval(s, packet->channel, dest)) {
8108-								return 0;
8109-							}
8110-						}
8111-					}
8112-					left -= count;
8113-				}
8114-				break;
8115-			}
8116-			}
8117-		}
8118-	}
8119-
8120-	return result;
8121-}
8122-
8123-static void *
8124-stbi__pic_load(stbi__context *s, int *px, int *py, int *comp, int req_comp,
8125-               stbi__result_info *ri)
8126-{
8127-	stbi_uc *result;
8128-	int i, x, y, internal_comp;
8129-	STBI_NOTUSED(ri);
8130-
8131-	if (!comp) {
8132-		comp = &internal_comp;
8133-	}
8134-
8135-	for (i = 0; i < 92; ++i) {
8136-		stbi__get8(s);
8137-	}
8138-
8139-	x = stbi__get16be(s);
8140-	y = stbi__get16be(s);
8141-
8142-	if (y > STBI_MAX_DIMENSIONS) {
8143-		return stbi__errpuc("too large", "Very large image (corrupt?)");
8144-	}
8145-	if (x > STBI_MAX_DIMENSIONS) {
8146-		return stbi__errpuc("too large", "Very large image (corrupt?)");
8147-	}
8148-
8149-	if (stbi__at_eof(s)) {
8150-		return stbi__errpuc("bad file", "file too short (pic header)");
8151-	}
8152-	if (!stbi__mad3sizes_valid(x, y, 4, 0)) {
8153-		return stbi__errpuc("too large", "PIC image too large to decode");
8154-	}
8155-
8156-	stbi__get32be(s); // skip `ratio'
8157-	stbi__get16be(s); // skip `fields'
8158-	stbi__get16be(s); // skip `pad'
8159-
8160-	// intermediate buffer is RGBA
8161-	result = (stbi_uc *)stbi__malloc_mad3(x, y, 4, 0);
8162-	if (!result) {
8163-		return stbi__errpuc("outofmem", "Out of memory");
8164-	}
8165-	memset(result, 0xff, x * y * 4);
8166-
8167-	if (!stbi__pic_load_core(s, x, y, comp, result)) {
8168-		STBI_FREE(result);
8169-		result = 0;
8170-	}
8171-	*px = x;
8172-	*py = y;
8173-	if (req_comp == 0) {
8174-		req_comp = *comp;
8175-	}
8176-	result = stbi__convert_format(result, 4, req_comp, x, y);
8177-
8178-	return result;
8179-}
8180-
8181-static int
8182-stbi__pic_test(stbi__context *s)
8183-{
8184-	int r = stbi__pic_test_core(s);
8185-	stbi__rewind(s);
8186-	return r;
8187-}
8188-#endif
8189-
8190-// *************************************************************************************************
8191-// GIF loader -- public domain by Jean-Marc Lienher -- simplified/shrunk by stb
8192-
8193-#ifndef STBI_NO_GIF
8194-typedef struct {
8195-	stbi__int16 prefix;
8196-	stbi_uc first;
8197-	stbi_uc suffix;
8198-} stbi__gif_lzw;
8199-
8200-typedef struct {
8201-	int w, h;
8202-	stbi_uc *out; // output buffer (always 4 components)
8203-	stbi_uc
8204-	    *background; // The current "background" as far as a gif is concerned
8205-	stbi_uc *history;
8206-	int flags, bgindex, ratio, transparent, eflags;
8207-	stbi_uc pal[256][4];
8208-	stbi_uc lpal[256][4];
8209-	stbi__gif_lzw codes[8192];
8210-	stbi_uc *color_table;
8211-	int parse, step;
8212-	int lflags;
8213-	int start_x, start_y;
8214-	int max_x, max_y;
8215-	int cur_x, cur_y;
8216-	int line_size;
8217-	int delay;
8218-} stbi__gif;
8219-
8220-static int
8221-stbi__gif_test_raw(stbi__context *s)
8222-{
8223-	int sz;
8224-	if (stbi__get8(s) != 'G' || stbi__get8(s) != 'I' || stbi__get8(s) != 'F' ||
8225-	    stbi__get8(s) != '8') {
8226-		return 0;
8227-	}
8228-	sz = stbi__get8(s);
8229-	if (sz != '9' && sz != '7') {
8230-		return 0;
8231-	}
8232-	if (stbi__get8(s) != 'a') {
8233-		return 0;
8234-	}
8235-	return 1;
8236-}
8237-
8238-static int
8239-stbi__gif_test(stbi__context *s)
8240-{
8241-	int r = stbi__gif_test_raw(s);
8242-	stbi__rewind(s);
8243-	return r;
8244-}
8245-
8246-static void
8247-stbi__gif_parse_colortable(stbi__context *s, stbi_uc pal[256][4],
8248-                           int num_entries, int transp)
8249-{
8250-	int i;
8251-	for (i = 0; i < num_entries; ++i) {
8252-		pal[i][2] = stbi__get8(s);
8253-		pal[i][1] = stbi__get8(s);
8254-		pal[i][0] = stbi__get8(s);
8255-		pal[i][3] = transp == i ? 0 : 255;
8256-	}
8257-}
8258-
8259-static int
8260-stbi__gif_header(stbi__context *s, stbi__gif *g, int *comp, int is_info)
8261-{
8262-	stbi_uc version;
8263-	if (stbi__get8(s) != 'G' || stbi__get8(s) != 'I' || stbi__get8(s) != 'F' ||
8264-	    stbi__get8(s) != '8') {
8265-		return stbi__err("not GIF", "Corrupt GIF");
8266-	}
8267-
8268-	version = stbi__get8(s);
8269-	if (version != '7' && version != '9') {
8270-		return stbi__err("not GIF", "Corrupt GIF");
8271-	}
8272-	if (stbi__get8(s) != 'a') {
8273-		return stbi__err("not GIF", "Corrupt GIF");
8274-	}
8275-
8276-	stbi__g_failure_reason = "";
8277-	g->w = stbi__get16le(s);
8278-	g->h = stbi__get16le(s);
8279-	g->flags = stbi__get8(s);
8280-	g->bgindex = stbi__get8(s);
8281-	g->ratio = stbi__get8(s);
8282-	g->transparent = -1;
8283-
8284-	if (g->w > STBI_MAX_DIMENSIONS) {
8285-		return stbi__err("too large", "Very large image (corrupt?)");
8286-	}
8287-	if (g->h > STBI_MAX_DIMENSIONS) {
8288-		return stbi__err("too large", "Very large image (corrupt?)");
8289-	}
8290-
8291-	if (comp != 0) {
8292-		*comp = 4; // can't actually tell whether it's 3 or 4 until we parse the
8293-		           // comments
8294-	}
8295-
8296-	if (is_info) {
8297-		return 1;
8298-	}
8299-
8300-	if (g->flags & 0x80) {
8301-		stbi__gif_parse_colortable(s, g->pal, 2 << (g->flags & 7), -1);
8302-	}
8303-
8304-	return 1;
8305-}
8306-
8307-static int
8308-stbi__gif_info_raw(stbi__context *s, int *x, int *y, int *comp)
8309-{
8310-	stbi__gif *g = (stbi__gif *)stbi__malloc(sizeof(stbi__gif));
8311-	if (!g) {
8312-		return stbi__err("outofmem", "Out of memory");
8313-	}
8314-	if (!stbi__gif_header(s, g, comp, 1)) {
8315-		STBI_FREE(g);
8316-		stbi__rewind(s);
8317-		return 0;
8318-	}
8319-	if (x) {
8320-		*x = g->w;
8321-	}
8322-	if (y) {
8323-		*y = g->h;
8324-	}
8325-	STBI_FREE(g);
8326-	return 1;
8327-}
8328-
8329-static void
8330-stbi__out_gif_code(stbi__gif *g, stbi__uint16 code)
8331-{
8332-	stbi_uc *p, *c;
8333-	int idx;
8334-
8335-	// recurse to decode the prefixes, since the linked-list is backwards,
8336-	// and working backwards through an interleaved image would be nasty
8337-	if (g->codes[code].prefix >= 0) {
8338-		stbi__out_gif_code(g, g->codes[code].prefix);
8339-	}
8340-
8341-	if (g->cur_y >= g->max_y) {
8342-		return;
8343-	}
8344-
8345-	idx = g->cur_x + g->cur_y;
8346-	p = &g->out[idx];
8347-	g->history[idx / 4] = 1;
8348-
8349-	c = &g->color_table[g->codes[code].suffix * 4];
8350-	if (c[3] > 128) { // don't render transparent pixels;
8351-		p[0] = c[2];
8352-		p[1] = c[1];
8353-		p[2] = c[0];
8354-		p[3] = c[3];
8355-	}
8356-	g->cur_x += 4;
8357-
8358-	if (g->cur_x >= g->max_x) {
8359-		g->cur_x = g->start_x;
8360-		g->cur_y += g->step;
8361-
8362-		while (g->cur_y >= g->max_y && g->parse > 0) {
8363-			g->step = (1 << g->parse) * g->line_size;
8364-			g->cur_y = g->start_y + (g->step >> 1);
8365-			--g->parse;
8366-		}
8367-	}
8368-}
8369-
8370-static stbi_uc *
8371-stbi__process_gif_raster(stbi__context *s, stbi__gif *g)
8372-{
8373-	stbi_uc lzw_cs;
8374-	stbi__int32 len, init_code;
8375-	stbi__uint32 first;
8376-	stbi__int32 codesize, codemask, avail, oldcode, bits, valid_bits, clear;
8377-	stbi__gif_lzw *p;
8378-
8379-	lzw_cs = stbi__get8(s);
8380-	if (lzw_cs > 12) {
8381-		return NULL;
8382-	}
8383-	clear = 1 << lzw_cs;
8384-	first = 1;
8385-	codesize = lzw_cs + 1;
8386-	codemask = (1 << codesize) - 1;
8387-	bits = 0;
8388-	valid_bits = 0;
8389-	for (init_code = 0; init_code < clear; init_code++) {
8390-		g->codes[init_code].prefix = -1;
8391-		g->codes[init_code].first = (stbi_uc)init_code;
8392-		g->codes[init_code].suffix = (stbi_uc)init_code;
8393-	}
8394-
8395-	// support no starting clear code
8396-	avail = clear + 2;
8397-	oldcode = -1;
8398-
8399-	len = 0;
8400-	for (;;) {
8401-		if (valid_bits < codesize) {
8402-			if (len == 0) {
8403-				len = stbi__get8(s); // start new block
8404-				if (len == 0) {
8405-					return g->out;
8406-				}
8407-			}
8408-			--len;
8409-			bits |= (stbi__int32)stbi__get8(s) << valid_bits;
8410-			valid_bits += 8;
8411-		} else {
8412-			stbi__int32 code = bits & codemask;
8413-			bits >>= codesize;
8414-			valid_bits -= codesize;
8415-			// @OPTIMIZE: is there some way we can accelerate the non-clear
8416-			// path?
8417-			if (code == clear) { // clear code
8418-				codesize = lzw_cs + 1;
8419-				codemask = (1 << codesize) - 1;
8420-				avail = clear + 2;
8421-				oldcode = -1;
8422-				first = 0;
8423-			} else if (code == clear + 1) { // end of stream code
8424-				stbi__skip(s, len);
8425-				while ((len = stbi__get8(s)) > 0) {
8426-					stbi__skip(s, len);
8427-				}
8428-				return g->out;
8429-			} else if (code <= avail) {
8430-				if (first) {
8431-					return stbi__errpuc("no clear code", "Corrupt GIF");
8432-				}
8433-
8434-				if (oldcode >= 0) {
8435-					p = &g->codes[avail++];
8436-					if (avail > 8192) {
8437-						return stbi__errpuc("too many codes", "Corrupt GIF");
8438-					}
8439-
8440-					p->prefix = (stbi__int16)oldcode;
8441-					p->first = g->codes[oldcode].first;
8442-					p->suffix =
8443-					    (code == avail) ? p->first : g->codes[code].first;
8444-				} else if (code == avail) {
8445-					return stbi__errpuc("illegal code in raster",
8446-					                    "Corrupt GIF");
8447-				}
8448-
8449-				stbi__out_gif_code(g, (stbi__uint16)code);
8450-
8451-				if ((avail & codemask) == 0 && avail <= 0x0FFF) {
8452-					codesize++;
8453-					codemask = (1 << codesize) - 1;
8454-				}
8455-
8456-				oldcode = code;
8457-			} else {
8458-				return stbi__errpuc("illegal code in raster", "Corrupt GIF");
8459-			}
8460-		}
8461-	}
8462-}
8463-
8464-// this function is designed to support animated gifs, although stb_image
8465-// doesn't support it two back is the image from two frames ago, used for a very
8466-// specific disposal format
8467-static stbi_uc *
8468-stbi__gif_load_next(stbi__context *s, stbi__gif *g, int *comp, int req_comp,
8469-                    stbi_uc *two_back)
8470-{
8471-	int dispose;
8472-	int first_frame;
8473-	int pi;
8474-	int pcount;
8475-	STBI_NOTUSED(req_comp);
8476-
8477-	// on first frame, any non-written pixels get the background colour
8478-	// (non-transparent)
8479-	first_frame = 0;
8480-	if (g->out == 0) {
8481-		if (!stbi__gif_header(s, g, comp, 0)) {
8482-			return 0; // stbi__g_failure_reason set by stbi__gif_header
8483-		}
8484-		if (!stbi__mad3sizes_valid(4, g->w, g->h, 0)) {
8485-			return stbi__errpuc("too large", "GIF image is too large");
8486-		}
8487-		pcount = g->w * g->h;
8488-		g->out = (stbi_uc *)stbi__malloc(4 * pcount);
8489-		g->background = (stbi_uc *)stbi__malloc(4 * pcount);
8490-		g->history = (stbi_uc *)stbi__malloc(pcount);
8491-		if (!g->out || !g->background || !g->history) {
8492-			return stbi__errpuc("outofmem", "Out of memory");
8493-		}
8494-
8495-		// image is treated as "transparent" at the start - ie, nothing
8496-		// overwrites the current background; background colour is only used for
8497-		// pixels that are not rendered first frame, after that "background"
8498-		// color refers to the color that was there the previous frame.
8499-		memset(g->out, 0x00, 4 * pcount);
8500-		memset(g->background, 0x00,
8501-		       4 * pcount); // state of the background (starts transparent)
8502-		memset(g->history, 0x00,
8503-		       pcount); // pixels that were affected previous frame
8504-		first_frame = 1;
8505-	} else {
8506-		// second frame - how do we dispose of the previous one?
8507-		dispose = (g->eflags & 0x1C) >> 2;
8508-		pcount = g->w * g->h;
8509-
8510-		if ((dispose == 3) && (two_back == 0)) {
8511-			dispose = 2; // if I don't have an image to revert back to, default
8512-			             // to the old background
8513-		}
8514-
8515-		if (dispose == 3) { // use previous graphic
8516-			for (pi = 0; pi < pcount; ++pi) {
8517-				if (g->history[pi]) {
8518-					memcpy(&g->out[pi * 4], &two_back[pi * 4], 4);
8519-				}
8520-			}
8521-		} else if (dispose == 2) {
8522-			// restore what was changed last frame to background before that
8523-			// frame;
8524-			for (pi = 0; pi < pcount; ++pi) {
8525-				if (g->history[pi]) {
8526-					memcpy(&g->out[pi * 4], &g->background[pi * 4], 4);
8527-				}
8528-			}
8529-		} else {
8530-			// This is a non-disposal case eithe way, so just
8531-			// leave the pixels as is, and they will become the new background
8532-			// 1: do not dispose
8533-			// 0:  not specified.
8534-		}
8535-
8536-		// background is what out is after the undoing of the previou frame;
8537-		memcpy(g->background, g->out, 4 * g->w * g->h);
8538-	}
8539-
8540-	// clear my history;
8541-	memset(g->history, 0x00,
8542-	       g->w * g->h); // pixels that were affected previous frame
8543-
8544-	for (;;) {
8545-		int tag = stbi__get8(s);
8546-		switch (tag) {
8547-		case 0x2C: /* Image Descriptor */
8548-		{
8549-			stbi__int32 x, y, w, h;
8550-			stbi_uc *o;
8551-
8552-			x = stbi__get16le(s);
8553-			y = stbi__get16le(s);
8554-			w = stbi__get16le(s);
8555-			h = stbi__get16le(s);
8556-			if (((x + w) > (g->w)) || ((y + h) > (g->h))) {
8557-				return stbi__errpuc("bad Image Descriptor", "Corrupt GIF");
8558-			}
8559-
8560-			g->line_size = g->w * 4;
8561-			g->start_x = x * 4;
8562-			g->start_y = y * g->line_size;
8563-			g->max_x = g->start_x + w * 4;
8564-			g->max_y = g->start_y + h * g->line_size;
8565-			g->cur_x = g->start_x;
8566-			g->cur_y = g->start_y;
8567-
8568-			// if the width of the specified rectangle is 0, that means
8569-			// we may not see *any* pixels or the image is malformed;
8570-			// to make sure this is caught, move the current y down to
8571-			// max_y (which is what out_gif_code checks).
8572-			if (w == 0) {
8573-				g->cur_y = g->max_y;
8574-			}
8575-
8576-			g->lflags = stbi__get8(s);
8577-
8578-			if (g->lflags & 0x40) {
8579-				g->step = 8 * g->line_size; // first interlaced spacing
8580-				g->parse = 3;
8581-			} else {
8582-				g->step = g->line_size;
8583-				g->parse = 0;
8584-			}
8585-
8586-			if (g->lflags & 0x80) {
8587-				stbi__gif_parse_colortable(s, g->lpal, 2 << (g->lflags & 7),
8588-				                           g->eflags & 0x01 ? g->transparent
8589-				                                            : -1);
8590-				g->color_table = (stbi_uc *)g->lpal;
8591-			} else if (g->flags & 0x80) {
8592-				g->color_table = (stbi_uc *)g->pal;
8593-			} else {
8594-				return stbi__errpuc("missing color table", "Corrupt GIF");
8595-			}
8596-
8597-			o = stbi__process_gif_raster(s, g);
8598-			if (!o) {
8599-				return NULL;
8600-			}
8601-
8602-			// if this was the first frame,
8603-			pcount = g->w * g->h;
8604-			if (first_frame && (g->bgindex > 0)) {
8605-				// if first frame, any pixel not drawn to gets the background
8606-				// color
8607-				for (pi = 0; pi < pcount; ++pi) {
8608-					if (g->history[pi] == 0) {
8609-						g->pal[g->bgindex][3] =
8610-						    255; // just in case it was made transparent, undo
8611-						         // that; It will be reset next frame if need
8612-						         // be;
8613-						memcpy(&g->out[pi * 4], &g->pal[g->bgindex], 4);
8614-					}
8615-				}
8616-			}
8617-
8618-			return o;
8619-		}
8620-
8621-		case 0x21: // Comment Extension.
8622-		{
8623-			int len;
8624-			int ext = stbi__get8(s);
8625-			if (ext == 0xF9) { // Graphic Control Extension.
8626-				len = stbi__get8(s);
8627-				if (len == 4) {
8628-					g->eflags = stbi__get8(s);
8629-					g->delay =
8630-					    10 * stbi__get16le(s); // delay - 1/100th of a second,
8631-					                           // saving as 1/1000ths.
8632-
8633-					// unset old transparent
8634-					if (g->transparent >= 0) {
8635-						g->pal[g->transparent][3] = 255;
8636-					}
8637-					if (g->eflags & 0x01) {
8638-						g->transparent = stbi__get8(s);
8639-						if (g->transparent >= 0) {
8640-							g->pal[g->transparent][3] = 0;
8641-						}
8642-					} else {
8643-						// don't need transparent
8644-						stbi__skip(s, 1);
8645-						g->transparent = -1;
8646-					}
8647-				} else {
8648-					stbi__skip(s, len);
8649-					break;
8650-				}
8651-			}
8652-			while ((len = stbi__get8(s)) != 0) {
8653-				stbi__skip(s, len);
8654-			}
8655-			break;
8656-		}
8657-
8658-		case 0x3B:               // gif stream termination code
8659-			return (stbi_uc *)s; // using '1' causes warning on some compilers
8660-
8661-		default:
8662-			return stbi__errpuc("unknown code", "Corrupt GIF");
8663-		}
8664-	}
8665-}
8666-
8667-static void *
8668-stbi__load_gif_main_outofmem(stbi__gif *g, stbi_uc *out, int **delays)
8669-{
8670-	STBI_FREE(g->out);
8671-	STBI_FREE(g->history);
8672-	STBI_FREE(g->background);
8673-
8674-	if (out) {
8675-		STBI_FREE(out);
8676-	}
8677-	if (delays && *delays) {
8678-		STBI_FREE(*delays);
8679-	}
8680-	return stbi__errpuc("outofmem", "Out of memory");
8681-}
8682-
8683-static void *
8684-stbi__load_gif_main(stbi__context *s, int **delays, int *x, int *y, int *z,
8685-                    int *comp, int req_comp)
8686-{
8687-	if (stbi__gif_test(s)) {
8688-		int layers = 0;
8689-		stbi_uc *u = 0;
8690-		stbi_uc *out = 0;
8691-		stbi_uc *two_back = 0;
8692-		stbi__gif g;
8693-		int stride;
8694-		int out_size = 0;
8695-		int delays_size = 0;
8696-
8697-		STBI_NOTUSED(out_size);
8698-		STBI_NOTUSED(delays_size);
8699-
8700-		memset(&g, 0, sizeof(g));
8701-		if (delays) {
8702-			*delays = 0;
8703-		}
8704-
8705-		do {
8706-			u = stbi__gif_load_next(s, &g, comp, req_comp, two_back);
8707-			if (u == (stbi_uc *)s) {
8708-				u = 0; // end of animated gif marker
8709-			}
8710-
8711-			if (u) {
8712-				*x = g.w;
8713-				*y = g.h;
8714-				++layers;
8715-				stride = g.w * g.h * 4;
8716-
8717-				if (out) {
8718-					void *tmp = (stbi_uc *)STBI_REALLOC_SIZED(out, out_size,
8719-					                                          layers * stride);
8720-					if (!tmp) {
8721-						return stbi__load_gif_main_outofmem(&g, out, delays);
8722-					} else {
8723-						out = (stbi_uc *)tmp;
8724-						out_size = layers * stride;
8725-					}
8726-
8727-					if (delays) {
8728-						int *new_delays = (int *)STBI_REALLOC_SIZED(
8729-						    *delays, delays_size, sizeof(int) * layers);
8730-						if (!new_delays) {
8731-							return stbi__load_gif_main_outofmem(&g, out,
8732-							                                    delays);
8733-						}
8734-						*delays = new_delays;
8735-						delays_size = layers * sizeof(int);
8736-					}
8737-				} else {
8738-					out = (stbi_uc *)stbi__malloc(layers * stride);
8739-					if (!out) {
8740-						return stbi__load_gif_main_outofmem(&g, out, delays);
8741-					}
8742-					out_size = layers * stride;
8743-					if (delays) {
8744-						*delays = (int *)stbi__malloc(layers * sizeof(int));
8745-						if (!*delays) {
8746-							return stbi__load_gif_main_outofmem(&g, out,
8747-							                                    delays);
8748-						}
8749-						delays_size = layers * sizeof(int);
8750-					}
8751-				}
8752-				memcpy(out + ((layers - 1) * stride), u, stride);
8753-				if (layers >= 2) {
8754-					two_back = out - 2 * stride;
8755-				}
8756-
8757-				if (delays) {
8758-					(*delays)[layers - 1U] = g.delay;
8759-				}
8760-			}
8761-		} while (u != 0);
8762-
8763-		// free temp buffer;
8764-		STBI_FREE(g.out);
8765-		STBI_FREE(g.history);
8766-		STBI_FREE(g.background);
8767-
8768-		// do the final conversion after loading everything;
8769-		if (req_comp && req_comp != 4) {
8770-			out = stbi__convert_format(out, 4, req_comp, layers * g.w, g.h);
8771-		}
8772-
8773-		*z = layers;
8774-		return out;
8775-	} else {
8776-		return stbi__errpuc("not GIF", "Image was not as a gif type.");
8777-	}
8778-}
8779-
8780-static void *
8781-stbi__gif_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
8782-               stbi__result_info *ri)
8783-{
8784-	stbi_uc *u = 0;
8785-	stbi__gif g;
8786-	memset(&g, 0, sizeof(g));
8787-	STBI_NOTUSED(ri);
8788-
8789-	u = stbi__gif_load_next(s, &g, comp, req_comp, 0);
8790-	if (u == (stbi_uc *)s) {
8791-		u = 0; // end of animated gif marker
8792-	}
8793-	if (u) {
8794-		*x = g.w;
8795-		*y = g.h;
8796-
8797-		// moved conversion to after successful load so that the same
8798-		// can be done for multiple frames.
8799-		if (req_comp && req_comp != 4) {
8800-			u = stbi__convert_format(u, 4, req_comp, g.w, g.h);
8801-		}
8802-	} else if (g.out) {
8803-		// if there was an error and we allocated an image buffer, free it!
8804-		STBI_FREE(g.out);
8805-	}
8806-
8807-	// free buffers needed for multiple frame loading;
8808-	STBI_FREE(g.history);
8809-	STBI_FREE(g.background);
8810-
8811-	return u;
8812-}
8813-
8814-static int
8815-stbi__gif_info(stbi__context *s, int *x, int *y, int *comp)
8816-{
8817-	return stbi__gif_info_raw(s, x, y, comp);
8818-}
8819-#endif
8820-
8821-// *************************************************************************************************
8822-// Radiance RGBE HDR loader
8823-// originally by Nicolas Schulz
8824-#ifndef STBI_NO_HDR
8825-static int
8826-stbi__hdr_test_core(stbi__context *s, const char *signature)
8827-{
8828-	int i;
8829-	for (i = 0; signature[i]; ++i) {
8830-		if (stbi__get8(s) != signature[i]) {
8831-			return 0;
8832-		}
8833-	}
8834-	stbi__rewind(s);
8835-	return 1;
8836-}
8837-
8838-static int
8839-stbi__hdr_test(stbi__context *s)
8840-{
8841-	int r = stbi__hdr_test_core(s, "#?RADIANCE\n");
8842-	stbi__rewind(s);
8843-	if (!r) {
8844-		r = stbi__hdr_test_core(s, "#?RGBE\n");
8845-		stbi__rewind(s);
8846-	}
8847-	return r;
8848-}
8849-
8850-#define STBI__HDR_BUFLEN 1024
8851-static char *
8852-stbi__hdr_gettoken(stbi__context *z, char *buffer)
8853-{
8854-	int len = 0;
8855-	char c = '\0';
8856-
8857-	c = (char)stbi__get8(z);
8858-
8859-	while (!stbi__at_eof(z) && c != '\n') {
8860-		buffer[len++] = c;
8861-		if (len == STBI__HDR_BUFLEN - 1) {
8862-			// flush to end of line
8863-			while (!stbi__at_eof(z) && stbi__get8(z) != '\n')
8864-				;
8865-			break;
8866-		}
8867-		c = (char)stbi__get8(z);
8868-	}
8869-
8870-	buffer[len] = 0;
8871-	return buffer;
8872-}
8873-
8874-static void
8875-stbi__hdr_convert(float *output, stbi_uc *input, int req_comp)
8876-{
8877-	if (input[3] != 0) {
8878-		float f1;
8879-		// Exponent
8880-		f1 = (float)ldexp(1.0f, input[3] - (int)(128 + 8));
8881-		if (req_comp <= 2) {
8882-			output[0] = (input[0] + input[1] + input[2]) * f1 / 3;
8883-		} else {
8884-			output[0] = input[0] * f1;
8885-			output[1] = input[1] * f1;
8886-			output[2] = input[2] * f1;
8887-		}
8888-		if (req_comp == 2) {
8889-			output[1] = 1;
8890-		}
8891-		if (req_comp == 4) {
8892-			output[3] = 1;
8893-		}
8894-	} else {
8895-		switch (req_comp) {
8896-		case 4:
8897-			output[3] = 1; /* fallthrough */
8898-		case 3:
8899-			output[0] = output[1] = output[2] = 0;
8900-			break;
8901-		case 2:
8902-			output[1] = 1; /* fallthrough */
8903-		case 1:
8904-			output[0] = 0;
8905-			break;
8906-		}
8907-	}
8908-}
8909-
8910-static float *
8911-stbi__hdr_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
8912-               stbi__result_info *ri)
8913-{
8914-	char buffer[STBI__HDR_BUFLEN];
8915-	char *token;
8916-	int valid = 0;
8917-	int width, height;
8918-	stbi_uc *scanline;
8919-	float *hdr_data;
8920-	int len;
8921-	unsigned char count, value;
8922-	int i, j, k, c1, c2, z;
8923-	const char *headerToken;
8924-	STBI_NOTUSED(ri);
8925-
8926-	// Check identifier
8927-	headerToken = stbi__hdr_gettoken(s, buffer);
8928-	if (strcmp(headerToken, "#?RADIANCE") != 0 &&
8929-	    strcmp(headerToken, "#?RGBE") != 0) {
8930-		return stbi__errpf("not HDR", "Corrupt HDR image");
8931-	}
8932-
8933-	// Parse header
8934-	for (;;) {
8935-		token = stbi__hdr_gettoken(s, buffer);
8936-		if (token[0] == 0) {
8937-			break;
8938-		}
8939-		if (strcmp(token, "FORMAT=32-bit_rle_rgbe") == 0) {
8940-			valid = 1;
8941-		}
8942-	}
8943-
8944-	if (!valid) {
8945-		return stbi__errpf("unsupported format", "Unsupported HDR format");
8946-	}
8947-
8948-	// Parse width and height
8949-	// can't use sscanf() if we're not using stdio!
8950-	token = stbi__hdr_gettoken(s, buffer);
8951-	if (strncmp(token, "-Y ", 3)) {
8952-		return stbi__errpf("unsupported data layout", "Unsupported HDR format");
8953-	}
8954-	token += 3;
8955-	height = (int)strtol(token, &token, 10);
8956-	while (*token == ' ') {
8957-		++token;
8958-	}
8959-	if (strncmp(token, "+X ", 3)) {
8960-		return stbi__errpf("unsupported data layout", "Unsupported HDR format");
8961-	}
8962-	token += 3;
8963-	width = (int)strtol(token, NULL, 10);
8964-
8965-	if (height > STBI_MAX_DIMENSIONS) {
8966-		return stbi__errpf("too large", "Very large image (corrupt?)");
8967-	}
8968-	if (width > STBI_MAX_DIMENSIONS) {
8969-		return stbi__errpf("too large", "Very large image (corrupt?)");
8970-	}
8971-
8972-	*x = width;
8973-	*y = height;
8974-
8975-	if (comp) {
8976-		*comp = 3;
8977-	}
8978-	if (req_comp == 0) {
8979-		req_comp = 3;
8980-	}
8981-
8982-	if (!stbi__mad4sizes_valid(width, height, req_comp, sizeof(float), 0)) {
8983-		return stbi__errpf("too large", "HDR image is too large");
8984-	}
8985-
8986-	// Read data
8987-	hdr_data =
8988-	    (float *)stbi__malloc_mad4(width, height, req_comp, sizeof(float), 0);
8989-	if (!hdr_data) {
8990-		return stbi__errpf("outofmem", "Out of memory");
8991-	}
8992-
8993-	// Load image data
8994-	// image data is stored as some number of sca
8995-	if (width < 8 || width >= 32768) {
8996-		// Read flat data
8997-		for (j = 0; j < height; ++j) {
8998-			for (i = 0; i < width; ++i) {
8999-				stbi_uc rgbe[4];
9000-			main_decode_loop:
9001-				stbi__getn(s, rgbe, 4);
9002-				stbi__hdr_convert(hdr_data + j * width * req_comp +
9003-				                      i * req_comp,
9004-				                  rgbe, req_comp);
9005-			}
9006-		}
9007-	} else {
9008-		// Read RLE-encoded data
9009-		scanline = NULL;
9010-
9011-		for (j = 0; j < height; ++j) {
9012-			c1 = stbi__get8(s);
9013-			c2 = stbi__get8(s);
9014-			len = stbi__get8(s);
9015-			if (c1 != 2 || c2 != 2 || (len & 0x80)) {
9016-				// not run-length encoded, so we have to actually use THIS data
9017-				// as a decoded pixel (note this can't be a valid pixel--one of
9018-				// RGB must be >= 128)
9019-				stbi_uc rgbe[4];
9020-				rgbe[0] = (stbi_uc)c1;
9021-				rgbe[1] = (stbi_uc)c2;
9022-				rgbe[2] = (stbi_uc)len;
9023-				rgbe[3] = (stbi_uc)stbi__get8(s);
9024-				stbi__hdr_convert(hdr_data, rgbe, req_comp);
9025-				i = 1;
9026-				j = 0;
9027-				STBI_FREE(scanline);
9028-				goto main_decode_loop; // yes, this makes no sense
9029-			}
9030-			len <<= 8;
9031-			len |= stbi__get8(s);
9032-			if (len != width) {
9033-				STBI_FREE(hdr_data);
9034-				STBI_FREE(scanline);
9035-				return stbi__errpf("invalid decoded scanline length",
9036-				                   "corrupt HDR");
9037-			}
9038-			if (scanline == NULL) {
9039-				scanline = (stbi_uc *)stbi__malloc_mad2(width, 4, 0);
9040-				if (!scanline) {
9041-					STBI_FREE(hdr_data);
9042-					return stbi__errpf("outofmem", "Out of memory");
9043-				}
9044-			}
9045-
9046-			for (k = 0; k < 4; ++k) {
9047-				int nleft;
9048-				i = 0;
9049-				while ((nleft = width - i) > 0) {
9050-					count = stbi__get8(s);
9051-					if (count > 128) {
9052-						// Run
9053-						value = stbi__get8(s);
9054-						count -= 128;
9055-						if ((count == 0) || (count > nleft)) {
9056-							STBI_FREE(hdr_data);
9057-							STBI_FREE(scanline);
9058-							return stbi__errpf("corrupt",
9059-							                   "bad RLE data in HDR");
9060-						}
9061-						for (z = 0; z < count; ++z) {
9062-							scanline[i++ * 4 + k] = value;
9063-						}
9064-					} else {
9065-						// Dump
9066-						if ((count == 0) || (count > nleft)) {
9067-							STBI_FREE(hdr_data);
9068-							STBI_FREE(scanline);
9069-							return stbi__errpf("corrupt",
9070-							                   "bad RLE data in HDR");
9071-						}
9072-						for (z = 0; z < count; ++z) {
9073-							scanline[i++ * 4 + k] = stbi__get8(s);
9074-						}
9075-					}
9076-				}
9077-			}
9078-			for (i = 0; i < width; ++i) {
9079-				stbi__hdr_convert(hdr_data + (j * width + i) * req_comp,
9080-				                  scanline + i * 4, req_comp);
9081-			}
9082-		}
9083-		if (scanline) {
9084-			STBI_FREE(scanline);
9085-		}
9086-	}
9087-
9088-	return hdr_data;
9089-}
9090-
9091-static int
9092-stbi__hdr_info(stbi__context *s, int *x, int *y, int *comp)
9093-{
9094-	char buffer[STBI__HDR_BUFLEN];
9095-	char *token;
9096-	int valid = 0;
9097-	int dummy;
9098-
9099-	if (!x) {
9100-		x = &dummy;
9101-	}
9102-	if (!y) {
9103-		y = &dummy;
9104-	}
9105-	if (!comp) {
9106-		comp = &dummy;
9107-	}
9108-
9109-	if (stbi__hdr_test(s) == 0) {
9110-		stbi__rewind(s);
9111-		return 0;
9112-	}
9113-
9114-	for (;;) {
9115-		token = stbi__hdr_gettoken(s, buffer);
9116-		if (token[0] == 0) {
9117-			break;
9118-		}
9119-		if (strcmp(token, "FORMAT=32-bit_rle_rgbe") == 0) {
9120-			valid = 1;
9121-		}
9122-	}
9123-
9124-	if (!valid) {
9125-		stbi__rewind(s);
9126-		return 0;
9127-	}
9128-	token = stbi__hdr_gettoken(s, buffer);
9129-	if (strncmp(token, "-Y ", 3)) {
9130-		stbi__rewind(s);
9131-		return 0;
9132-	}
9133-	token += 3;
9134-	*y = (int)strtol(token, &token, 10);
9135-	while (*token == ' ') {
9136-		++token;
9137-	}
9138-	if (strncmp(token, "+X ", 3)) {
9139-		stbi__rewind(s);
9140-		return 0;
9141-	}
9142-	token += 3;
9143-	*x = (int)strtol(token, NULL, 10);
9144-	*comp = 3;
9145-	return 1;
9146-}
9147-#endif // STBI_NO_HDR
9148-
9149-#ifndef STBI_NO_BMP
9150-static int
9151-stbi__bmp_info(stbi__context *s, int *x, int *y, int *comp)
9152-{
9153-	void *p;
9154-	stbi__bmp_data info;
9155-
9156-	info.all_a = 255;
9157-	p = stbi__bmp_parse_header(s, &info);
9158-	if (p == NULL) {
9159-		stbi__rewind(s);
9160-		return 0;
9161-	}
9162-	if (x) {
9163-		*x = s->img_x;
9164-	}
9165-	if (y) {
9166-		*y = s->img_y;
9167-	}
9168-	if (comp) {
9169-		if (info.bpp == 24 && info.ma == 0xff000000) {
9170-			*comp = 3;
9171-		} else {
9172-			*comp = info.ma ? 4 : 3;
9173-		}
9174-	}
9175-	return 1;
9176-}
9177-#endif
9178-
9179-#ifndef STBI_NO_PSD
9180-static int
9181-stbi__psd_info(stbi__context *s, int *x, int *y, int *comp)
9182-{
9183-	int channelCount, dummy, depth;
9184-	if (!x) {
9185-		x = &dummy;
9186-	}
9187-	if (!y) {
9188-		y = &dummy;
9189-	}
9190-	if (!comp) {
9191-		comp = &dummy;
9192-	}
9193-	if (stbi__get32be(s) != 0x38425053) {
9194-		stbi__rewind(s);
9195-		return 0;
9196-	}
9197-	if (stbi__get16be(s) != 1) {
9198-		stbi__rewind(s);
9199-		return 0;
9200-	}
9201-	stbi__skip(s, 6);
9202-	channelCount = stbi__get16be(s);
9203-	if (channelCount < 0 || channelCount > 16) {
9204-		stbi__rewind(s);
9205-		return 0;
9206-	}
9207-	*y = stbi__get32be(s);
9208-	*x = stbi__get32be(s);
9209-	depth = stbi__get16be(s);
9210-	if (depth != 8 && depth != 16) {
9211-		stbi__rewind(s);
9212-		return 0;
9213-	}
9214-	if (stbi__get16be(s) != 3) {
9215-		stbi__rewind(s);
9216-		return 0;
9217-	}
9218-	*comp = 4;
9219-	return 1;
9220-}
9221-
9222-static int
9223-stbi__psd_is16(stbi__context *s)
9224-{
9225-	int channelCount, depth;
9226-	if (stbi__get32be(s) != 0x38425053) {
9227-		stbi__rewind(s);
9228-		return 0;
9229-	}
9230-	if (stbi__get16be(s) != 1) {
9231-		stbi__rewind(s);
9232-		return 0;
9233-	}
9234-	stbi__skip(s, 6);
9235-	channelCount = stbi__get16be(s);
9236-	if (channelCount < 0 || channelCount > 16) {
9237-		stbi__rewind(s);
9238-		return 0;
9239-	}
9240-	STBI_NOTUSED(stbi__get32be(s));
9241-	STBI_NOTUSED(stbi__get32be(s));
9242-	depth = stbi__get16be(s);
9243-	if (depth != 16) {
9244-		stbi__rewind(s);
9245-		return 0;
9246-	}
9247-	return 1;
9248-}
9249-#endif
9250-
9251-#ifndef STBI_NO_PIC
9252-static int
9253-stbi__pic_info(stbi__context *s, int *x, int *y, int *comp)
9254-{
9255-	int act_comp = 0, num_packets = 0, chained, dummy;
9256-	stbi__pic_packet packets[10];
9257-
9258-	if (!x) {
9259-		x = &dummy;
9260-	}
9261-	if (!y) {
9262-		y = &dummy;
9263-	}
9264-	if (!comp) {
9265-		comp = &dummy;
9266-	}
9267-
9268-	if (!stbi__pic_is4(s, "\x53\x80\xF6\x34")) {
9269-		stbi__rewind(s);
9270-		return 0;
9271-	}
9272-
9273-	stbi__skip(s, 88);
9274-
9275-	*x = stbi__get16be(s);
9276-	*y = stbi__get16be(s);
9277-	if (stbi__at_eof(s)) {
9278-		stbi__rewind(s);
9279-		return 0;
9280-	}
9281-	if ((*x) != 0 && (1 << 28) / (*x) < (*y)) {
9282-		stbi__rewind(s);
9283-		return 0;
9284-	}
9285-
9286-	stbi__skip(s, 8);
9287-
9288-	do {
9289-		stbi__pic_packet *packet;
9290-
9291-		if (num_packets == sizeof(packets) / sizeof(packets[0])) {
9292-			return 0;
9293-		}
9294-
9295-		packet = &packets[num_packets++];
9296-		chained = stbi__get8(s);
9297-		packet->size = stbi__get8(s);
9298-		packet->type = stbi__get8(s);
9299-		packet->channel = stbi__get8(s);
9300-		act_comp |= packet->channel;
9301-
9302-		if (stbi__at_eof(s)) {
9303-			stbi__rewind(s);
9304-			return 0;
9305-		}
9306-		if (packet->size != 8) {
9307-			stbi__rewind(s);
9308-			return 0;
9309-		}
9310-	} while (chained);
9311-
9312-	*comp = (act_comp & 0x10 ? 4 : 3);
9313-
9314-	return 1;
9315-}
9316-#endif
9317-
9318-// *************************************************************************************************
9319-// Portable Gray Map and Portable Pixel Map loader
9320-// by Ken Miller
9321-//
9322-// PGM: http://netpbm.sourceforge.net/doc/pgm.html
9323-// PPM: http://netpbm.sourceforge.net/doc/ppm.html
9324-//
9325-// Known limitations:
9326-//    Does not support comments in the header section
9327-//    Does not support ASCII image data (formats P2 and P3)
9328-
9329-#ifndef STBI_NO_PNM
9330-
9331-static int
9332-stbi__pnm_test(stbi__context *s)
9333-{
9334-	char p, t;
9335-	p = (char)stbi__get8(s);
9336-	t = (char)stbi__get8(s);
9337-	if (p != 'P' || (t != '5' && t != '6')) {
9338-		stbi__rewind(s);
9339-		return 0;
9340-	}
9341-	return 1;
9342-}
9343-
9344-static void *
9345-stbi__pnm_load(stbi__context *s, int *x, int *y, int *comp, int req_comp,
9346-               stbi__result_info *ri)
9347-{
9348-	stbi_uc *out;
9349-	STBI_NOTUSED(ri);
9350-
9351-	ri->bits_per_channel =
9352-	    stbi__pnm_info(s, (int *)&s->img_x, (int *)&s->img_y, (int *)&s->img_n);
9353-	if (ri->bits_per_channel == 0) {
9354-		return 0;
9355-	}
9356-
9357-	if (s->img_y > STBI_MAX_DIMENSIONS) {
9358-		return stbi__errpuc("too large", "Very large image (corrupt?)");
9359-	}
9360-	if (s->img_x > STBI_MAX_DIMENSIONS) {
9361-		return stbi__errpuc("too large", "Very large image (corrupt?)");
9362-	}
9363-
9364-	*x = s->img_x;
9365-	*y = s->img_y;
9366-	if (comp) {
9367-		*comp = s->img_n;
9368-	}
9369-
9370-	if (!stbi__mad4sizes_valid(s->img_n, s->img_x, s->img_y,
9371-	                           ri->bits_per_channel / 8, 0)) {
9372-		return stbi__errpuc("too large", "PNM too large");
9373-	}
9374-
9375-	out = (stbi_uc *)stbi__malloc_mad4(s->img_n, s->img_x, s->img_y,
9376-	                                   ri->bits_per_channel / 8, 0);
9377-	if (!out) {
9378-		return stbi__errpuc("outofmem", "Out of memory");
9379-	}
9380-	if (!stbi__getn(s, out,
9381-	                s->img_n * s->img_x * s->img_y *
9382-	                    (ri->bits_per_channel / 8))) {
9383-		STBI_FREE(out);
9384-		return stbi__errpuc("bad PNM", "PNM file truncated");
9385-	}
9386-
9387-	if (req_comp && req_comp != s->img_n) {
9388-		if (ri->bits_per_channel == 16) {
9389-			out = (stbi_uc *)stbi__convert_format16(
9390-			    (stbi__uint16 *)out, s->img_n, req_comp, s->img_x, s->img_y);
9391-		} else {
9392-			out = stbi__convert_format(out, s->img_n, req_comp, s->img_x,
9393-			                           s->img_y);
9394-		}
9395-		if (out == NULL) {
9396-			return out; // stbi__convert_format frees input on failure
9397-		}
9398-	}
9399-	return out;
9400-}
9401-
9402-static int
9403-stbi__pnm_isspace(char c)
9404-{
9405-	return c == ' ' || c == '\t' || c == '\n' || c == '\v' || c == '\f' ||
9406-	       c == '\r';
9407-}
9408-
9409-static void
9410-stbi__pnm_skip_whitespace(stbi__context *s, char *c)
9411-{
9412-	for (;;) {
9413-		while (!stbi__at_eof(s) && stbi__pnm_isspace(*c)) {
9414-			*c = (char)stbi__get8(s);
9415-		}
9416-
9417-		if (stbi__at_eof(s) || *c != '#') {
9418-			break;
9419-		}
9420-
9421-		while (!stbi__at_eof(s) && *c != '\n' && *c != '\r') {
9422-			*c = (char)stbi__get8(s);
9423-		}
9424-	}
9425-}
9426-
9427-static int
9428-stbi__pnm_isdigit(char c)
9429-{
9430-	return c >= '0' && c <= '9';
9431-}
9432-
9433-static int
9434-stbi__pnm_getinteger(stbi__context *s, char *c)
9435-{
9436-	int value = 0;
9437-
9438-	while (!stbi__at_eof(s) && stbi__pnm_isdigit(*c)) {
9439-		value = value * 10 + (*c - '0');
9440-		*c = (char)stbi__get8(s);
9441-		if ((value > 214748364) || (value == 214748364 && *c > '7')) {
9442-			return stbi__err(
9443-			    "integer parse overflow",
9444-			    "Parsing an integer in the PPM header overflowed a 32-bit int");
9445-		}
9446-	}
9447-
9448-	return value;
9449-}
9450-
9451-static int
9452-stbi__pnm_info(stbi__context *s, int *x, int *y, int *comp)
9453-{
9454-	int maxv, dummy;
9455-	char c, p, t;
9456-
9457-	if (!x) {
9458-		x = &dummy;
9459-	}
9460-	if (!y) {
9461-		y = &dummy;
9462-	}
9463-	if (!comp) {
9464-		comp = &dummy;
9465-	}
9466-
9467-	stbi__rewind(s);
9468-
9469-	// Get identifier
9470-	p = (char)stbi__get8(s);
9471-	t = (char)stbi__get8(s);
9472-	if (p != 'P' || (t != '5' && t != '6')) {
9473-		stbi__rewind(s);
9474-		return 0;
9475-	}
9476-
9477-	*comp =
9478-	    (t == '6') ? 3 : 1; // '5' is 1-component .pgm; '6' is 3-component .ppm
9479-
9480-	c = (char)stbi__get8(s);
9481-	stbi__pnm_skip_whitespace(s, &c);
9482-
9483-	*x = stbi__pnm_getinteger(s, &c); // read width
9484-	if (*x == 0) {
9485-		return stbi__err("invalid width",
9486-		                 "PPM image header had zero or overflowing width");
9487-	}
9488-	stbi__pnm_skip_whitespace(s, &c);
9489-
9490-	*y = stbi__pnm_getinteger(s, &c); // read height
9491-	if (*y == 0) {
9492-		return stbi__err("invalid width",
9493-		                 "PPM image header had zero or overflowing width");
9494-	}
9495-	stbi__pnm_skip_whitespace(s, &c);
9496-
9497-	maxv = stbi__pnm_getinteger(s, &c); // read max value
9498-	if (maxv > 65535) {
9499-		return stbi__err("max value > 65535",
9500-		                 "PPM image supports only 8-bit and 16-bit images");
9501-	} else if (maxv > 255) {
9502-		return 16;
9503-	} else {
9504-		return 8;
9505-	}
9506-}
9507-
9508-static int
9509-stbi__pnm_is16(stbi__context *s)
9510-{
9511-	if (stbi__pnm_info(s, NULL, NULL, NULL) == 16) {
9512-		return 1;
9513-	}
9514-	return 0;
9515-}
9516-#endif
9517-
9518-static int
9519-stbi__info_main(stbi__context *s, int *x, int *y, int *comp)
9520-{
9521-#ifndef STBI_NO_JPEG
9522-	if (stbi__jpeg_info(s, x, y, comp)) {
9523-		return 1;
9524-	}
9525-#endif
9526-
9527-#ifndef STBI_NO_PNG
9528-	if (stbi__png_info(s, x, y, comp)) {
9529-		return 1;
9530-	}
9531-#endif
9532-
9533-#ifndef STBI_NO_GIF
9534-	if (stbi__gif_info(s, x, y, comp)) {
9535-		return 1;
9536-	}
9537-#endif
9538-
9539-#ifndef STBI_NO_BMP
9540-	if (stbi__bmp_info(s, x, y, comp)) {
9541-		return 1;
9542-	}
9543-#endif
9544-
9545-#ifndef STBI_NO_PSD
9546-	if (stbi__psd_info(s, x, y, comp)) {
9547-		return 1;
9548-	}
9549-#endif
9550-
9551-#ifndef STBI_NO_PIC
9552-	if (stbi__pic_info(s, x, y, comp)) {
9553-		return 1;
9554-	}
9555-#endif
9556-
9557-#ifndef STBI_NO_PNM
9558-	if (stbi__pnm_info(s, x, y, comp)) {
9559-		return 1;
9560-	}
9561-#endif
9562-
9563-#ifndef STBI_NO_HDR
9564-	if (stbi__hdr_info(s, x, y, comp)) {
9565-		return 1;
9566-	}
9567-#endif
9568-
9569-// test tga last because it's a crappy test!
9570-#ifndef STBI_NO_TGA
9571-	if (stbi__tga_info(s, x, y, comp)) {
9572-		return 1;
9573-	}
9574-#endif
9575-	return stbi__err("unknown image type",
9576-	                 "Image not of any known type, or corrupt");
9577-}
9578-
9579-static int
9580-stbi__is_16_main(stbi__context *s)
9581-{
9582-#ifndef STBI_NO_PNG
9583-	if (stbi__png_is16(s)) {
9584-		return 1;
9585-	}
9586-#endif
9587-
9588-#ifndef STBI_NO_PSD
9589-	if (stbi__psd_is16(s)) {
9590-		return 1;
9591-	}
9592-#endif
9593-
9594-#ifndef STBI_NO_PNM
9595-	if (stbi__pnm_is16(s)) {
9596-		return 1;
9597-	}
9598-#endif
9599-	return 0;
9600-}
9601-
9602-#ifndef STBI_NO_STDIO
9603-STBIDEF int
9604-stbi_info(char const *filename, int *x, int *y, int *comp)
9605-{
9606-	FILE *f = stbi__fopen(filename, "rb");
9607-	int result;
9608-	if (!f) {
9609-		return stbi__err("can't fopen", "Unable to open file");
9610-	}
9611-	result = stbi_info_from_file(f, x, y, comp);
9612-	fclose(f);
9613-	return result;
9614-}
9615-
9616-STBIDEF int
9617-stbi_info_from_file(FILE *f, int *x, int *y, int *comp)
9618-{
9619-	int r;
9620-	stbi__context s;
9621-	long pos = ftell(f);
9622-	stbi__start_file(&s, f);
9623-	r = stbi__info_main(&s, x, y, comp);
9624-	fseek(f, pos, SEEK_SET);
9625-	return r;
9626-}
9627-
9628-STBIDEF int
9629-stbi_is_16_bit(char const *filename)
9630-{
9631-	FILE *f = stbi__fopen(filename, "rb");
9632-	int result;
9633-	if (!f) {
9634-		return stbi__err("can't fopen", "Unable to open file");
9635-	}
9636-	result = stbi_is_16_bit_from_file(f);
9637-	fclose(f);
9638-	return result;
9639-}
9640-
9641-STBIDEF int
9642-stbi_is_16_bit_from_file(FILE *f)
9643-{
9644-	int r;
9645-	stbi__context s;
9646-	long pos = ftell(f);
9647-	stbi__start_file(&s, f);
9648-	r = stbi__is_16_main(&s);
9649-	fseek(f, pos, SEEK_SET);
9650-	return r;
9651-}
9652-#endif // !STBI_NO_STDIO
9653-
9654-STBIDEF int
9655-stbi_info_from_memory(stbi_uc const *buffer, int len, int *x, int *y, int *comp)
9656-{
9657-	stbi__context s;
9658-	stbi__start_mem(&s, buffer, len);
9659-	return stbi__info_main(&s, x, y, comp);
9660-}
9661-
9662-STBIDEF int
9663-stbi_info_from_callbacks(stbi_io_callbacks const *c, void *user, int *x, int *y,
9664-                         int *comp)
9665-{
9666-	stbi__context s;
9667-	stbi__start_callbacks(&s, (stbi_io_callbacks *)c, user);
9668-	return stbi__info_main(&s, x, y, comp);
9669-}
9670-
9671-STBIDEF int
9672-stbi_is_16_bit_from_memory(stbi_uc const *buffer, int len)
9673-{
9674-	stbi__context s;
9675-	stbi__start_mem(&s, buffer, len);
9676-	return stbi__is_16_main(&s);
9677-}
9678-
9679-STBIDEF int
9680-stbi_is_16_bit_from_callbacks(stbi_io_callbacks const *c, void *user)
9681-{
9682-	stbi__context s;
9683-	stbi__start_callbacks(&s, (stbi_io_callbacks *)c, user);
9684-	return stbi__is_16_main(&s);
9685-}
9686-
9687-#endif // STB_IMAGE_IMPLEMENTATION
9688-
9689-/*
9690-   revision history:
9691-      2.20  (2019-02-07) support utf8 filenames in Windows; fix warnings and
9692-   platform ifdefs 2.19  (2018-02-11) fix warning 2.18  (2018-01-30) fix
9693-   warnings 2.17  (2018-01-29) change sbti__shiftsigned to avoid clang -O2 bug
9694-                         1-bit BMP
9695-                         *_is_16_bit api
9696-                         avoid warnings
9697-      2.16  (2017-07-23) all functions have 16-bit variants;
9698-                         STBI_NO_STDIO works again;
9699-                         compilation fixes;
9700-                         fix rounding in unpremultiply;
9701-                         optimize vertical flip;
9702-                         disable raw_len validation;
9703-                         documentation fixes
9704-      2.15  (2017-03-18) fix png-1,2,4 bug; now all Imagenet JPGs decode;
9705-                         warning fixes; disable run-time SSE detection on gcc;
9706-                         uniform handling of optional "return" values;
9707-                         thread-safe initialization of zlib tables
9708-      2.14  (2017-03-03) remove deprecated STBI_JPEG_OLD; fixes for Imagenet
9709-   JPGs 2.13  (2016-11-29) add 16-bit API, only supported for PNG right now 2.12
9710-   (2016-04-02) fix typo in 2.11 PSD fix that caused crashes 2.11  (2016-04-02)
9711-   allocate large structures on the stack remove white matting for transparent
9712-   PSD fix reported channel count for PNG & BMP re-enable SSE2 in non-gcc 64-bit
9713-                         support RGB-formatted JPEG
9714-                         read 16-bit PNGs (only as 8-bit)
9715-      2.10  (2016-01-22) avoid warning introduced in 2.09 by STBI_REALLOC_SIZED
9716-      2.09  (2016-01-16) allow comments in PNM files
9717-                         16-bit-per-pixel TGA (not bit-per-component)
9718-                         info() for TGA could break due to .hdr handling
9719-                         info() for BMP to shares code instead of sloppy parse
9720-                         can use STBI_REALLOC_SIZED if allocator doesn't support
9721-   realloc code cleanup 2.08  (2015-09-13) fix to 2.07 cleanup, reading RGB PSD
9722-   as RGBA 2.07  (2015-09-13) fix compiler warnings partial animated GIF support
9723-                         limited 16-bpc PSD support
9724-                         #ifdef unused functions
9725-                         bug with < 92 byte PIC,PNM,HDR,TGA
9726-      2.06  (2015-04-19) fix bug where PSD returns wrong '*comp' value
9727-      2.05  (2015-04-19) fix bug in progressive JPEG handling, fix warning
9728-      2.04  (2015-04-15) try to re-enable SIMD on MinGW 64-bit
9729-      2.03  (2015-04-12) extra corruption checking (mmozeiko)
9730-                         stbi_set_flip_vertically_on_load (nguillemot)
9731-                         fix NEON support; fix mingw support
9732-      2.02  (2015-01-19) fix incorrect assert, fix warning
9733-      2.01  (2015-01-17) fix various warnings; suppress SIMD on gcc 32-bit
9734-   without -msse2 2.00b (2014-12-25) fix STBI_MALLOC in progressive JPEG 2.00
9735-   (2014-12-25) optimize JPG, including x86 SSE2 & NEON SIMD (ryg) progressive
9736-   JPEG (stb) PGM/PPM support (Ken Miller) STBI_MALLOC,STBI_REALLOC,STBI_FREE
9737-                         GIF bugfix -- seemingly never worked
9738-                         STBI_NO_*, STBI_ONLY_*
9739-      1.48  (2014-12-14) fix incorrectly-named assert()
9740-      1.47  (2014-12-14) 1/2/4-bit PNG support, both direct and paletted (Omar
9741-   Cornut & stb) optimize PNG (ryg) fix bug in interlaced PNG with
9742-   user-specified channel count (stb) 1.46  (2014-08-26) fix broken tRNS chunk
9743-   (colorkey-style transparency) in non-paletted PNG 1.45  (2014-08-16) fix
9744-   MSVC-ARM internal compiler error by wrapping malloc 1.44  (2014-08-07)
9745-              various warning fixes from Ronny Chevalier
9746-      1.43  (2014-07-15)
9747-              fix MSVC-only compiler problem in code changed in 1.42
9748-      1.42  (2014-07-09)
9749-              don't define _CRT_SECURE_NO_WARNINGS (affects user code)
9750-              fixes to stbi__cleanup_jpeg path
9751-              added STBI_ASSERT to avoid requiring assert.h
9752-      1.41  (2014-06-25)
9753-              fix search&replace from 1.36 that messed up comments/error
9754-   messages 1.40  (2014-06-22) fix gcc struct-initialization warning 1.39
9755-   (2014-06-15) fix to TGA optimization when req_comp != number of components in
9756-   TGA; fix to GIF loading because BMP wasn't rewinding (whoops, no GIFs in my
9757-   test suite) add support for BMP version 5 (more ignored fields) 1.38
9758-   (2014-06-06) suppress MSVC warnings on integer casts truncating values fix
9759-   accidental rename of 'skip' field of I/O 1.37  (2014-06-04) remove duplicate
9760-   typedef 1.36  (2014-06-03) convert to header file single-file library if
9761-   de-iphone isn't set, load iphone images color-swapped instead of returning
9762-   NULL 1.35  (2014-05-27) various warnings fix broken STBI_SIMD path fix bug
9763-   where stbi_load_from_file no longer left file pointer in correct place fix
9764-   broken non-easy path for 32-bit BMP (possibly never used) TGA optimization by
9765-   Arseny Kapoulkine 1.34  (unknown) use STBI_NOTUSED in
9766-   stbi__resample_row_generic(), fix one more leak in tga failure case 1.33
9767-   (2011-07-14) make stbi_is_hdr work in STBI_NO_HDR (as specified), minor
9768-   compiler-friendly improvements 1.32  (2011-07-13) support for "info" function
9769-   for all supported filetypes (SpartanJ) 1.31  (2011-06-20) a few more leak
9770-   fixes, bug in PNG handling (SpartanJ) 1.30  (2011-06-11) added ability to
9771-   load files via callbacks to accomidate custom input streams (Ben Wenger)
9772-              removed deprecated format-specific test/load functions
9773-              removed support for installable file formats (stbi_loader) --
9774-   would have been broken for IO callbacks anyway error cases in bmp and tga
9775-   give messages and don't leak (Raymond Barbiero, grisha) fix inefficiency in
9776-   decoding 32-bit BMP (David Woo) 1.29  (2010-08-16) various warning fixes from
9777-   Aurelien Pocheville 1.28  (2010-08-01) fix bug in GIF palette transparency
9778-   (SpartanJ) 1.27  (2010-08-01) cast-to-stbi_uc to fix warnings 1.26
9779-   (2010-07-24) fix bug in file buffering for PNG reported by SpartanJ 1.25
9780-   (2010-07-17) refix trans_data warning (Won Chun) 1.24  (2010-07-12) perf
9781-   improvements reading from files on platforms with lock-heavy fgetc() minor
9782-   perf improvements for jpeg deprecated type-specific functions so we'll get
9783-   feedback if they're needed attempt to fix trans_data warning (Won Chun) 1.23
9784-   fixed bug in iPhone support 1.22  (2010-07-10) removed image *writing*
9785-   support stbi_info support from Jetro Lauha GIF support from Jean-Marc Lienher
9786-              iPhone PNG-extensions from James Brown
9787-              warning-fixes from Nicolas Schulz and Janez Zemva (i.stbi__err.
9788-   Janez (U+017D)emva) 1.21    fix use of 'stbi_uc' in header (reported by jon
9789-   blow) 1.20    added support for Softimage PIC, by Tom Seddon 1.19    bug in
9790-   interlaced PNG corruption check (found by ryg) 1.18  (2008-08-02) fix a
9791-   threading bug (local mutable static) 1.17    support interlaced PNG 1.16
9792-   major bugfix - stbi__convert_format converted one too many pixels 1.15
9793-   initialize some fields for thread safety 1.14    fix threadsafe conversion
9794-   bug header-file-only version (#define STBI_HEADER_FILE_ONLY before including)
9795-      1.13    threadsafe
9796-      1.12    const qualifiers in the API
9797-      1.11    Support installable IDCT, colorspace conversion routines
9798-      1.10    Fixes for 64-bit (don't use "unsigned long")
9799-              optimized upsampling by Fabian "ryg" Giesen
9800-      1.09    Fix format-conversion for PSD code (bad global variables!)
9801-      1.08    Thatcher Ulrich's PSD code integrated by Nicolas Schulz
9802-      1.07    attempt to fix C++ warning/errors again
9803-      1.06    attempt to fix C++ warning/errors again
9804-      1.05    fix TGA loading to return correct *comp and use good luminance
9805-   calc 1.04    default float alpha is 1, not 255; use 'void *' for
9806-   stbi_image_free 1.03    bugfixes to STBI_NO_STDIO, STBI_NO_HDR 1.02 support
9807-   for (subset of) HDR files, float interface for preferred access to them 1.01
9808-   fix bug: possible bug in handling right-side up bmps... not sure fix bug: the
9809-   stbi__bmp_load() and stbi__tga_load() functions didn't work at all 1.00
9810-   interface to zlib that skips zlib header 0.99    correct handling of alpha in
9811-   palette 0.98    TGA loader by lonesock; dynamically add loaders (untested)
9812-      0.97    jpeg errors on too large a file; also catch another malloc failure
9813-      0.96    fix detection of invalid v value - particleman@mollyrocket forum
9814-      0.95    during header scan, seek to markers in case of padding
9815-      0.94    STBI_NO_STDIO to disable stdio usage; rename all #defines the same
9816-      0.93    handle jpegtran output; verbose errors
9817-      0.92    read 4,8,16,24,32-bit BMP files of several formats
9818-      0.91    output 24-bit Windows 3.0 BMP files
9819-      0.90    fix a few more warnings; bump version number to approach 1.0
9820-      0.61    bugfixes due to Marc LeBlanc, Christopher Lloyd
9821-      0.60    fix compiling as c++
9822-      0.59    fix warnings: merge Dave Moore's -Wall fixes
9823-      0.58    fix bug: zlib uncompressed mode len/nlen was wrong endian
9824-      0.57    fix bug: jpg last huffman symbol before marker was >9 bits but
9825-   less than 16 available 0.56    fix bug: zlib uncompressed mode len vs. nlen
9826-      0.55    fix bug: restart_interval not initialized to 0
9827-      0.54    allow NULL for 'int *comp'
9828-      0.53    fix bug in png 3->4; speedup png decoding
9829-      0.52    png handles req_comp=3,4 directly; minor cleanup; jpeg comments
9830-      0.51    obey req_comp requests, 1-component jpegs return as 1-component,
9831-              on 'test' only check type, not whether we support this variant
9832-      0.50  (2006-11-19)
9833-              first released version
9834-*/
9835-
9836-/*
9837-------------------------------------------------------------------------------
9838-This software is available under 2 licenses -- choose whichever you prefer.
9839-------------------------------------------------------------------------------
9840-ALTERNATIVE A - MIT License
9841-Copyright (c) 2017 Sean Barrett
9842-Permission is hereby granted, free of charge, to any person obtaining a copy of
9843-this software and associated documentation files (the "Software"), to deal in
9844-the Software without restriction, including without limitation the rights to
9845-use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
9846-of the Software, and to permit persons to whom the Software is furnished to do
9847-so, subject to the following conditions:
9848-The above copyright notice and this permission notice shall be included in all
9849-copies or substantial portions of the Software.
9850-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
9851-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
9852-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
9853-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
9854-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
9855-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
9856-SOFTWARE.
9857-------------------------------------------------------------------------------
9858-ALTERNATIVE B - Public Domain (www.unlicense.org)
9859-This is free and unencumbered software released into the public domain.
9860-Anyone is free to copy, modify, publish, use, compile, sell, or distribute this
9861-software, either in source code form or as a compiled binary, for any purpose,
9862-commercial or non-commercial, and by any means.
9863-In jurisdictions that recognize copyright laws, the author or authors of this
9864-software dedicate any and all copyright interest in the software to the public
9865-domain. We make this dedication for the benefit of the public at large and to
9866-the detriment of our heirs and successors. We intend this dedication to be an
9867-overt act of relinquishment in perpetuity of all present and future rights to
9868-this software under copyright law.
9869-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
9870-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
9871-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
9872-AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
9873-ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
9874-WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
9875-------------------------------------------------------------------------------
9876-*/
+0, -13259
    1@@ -1,13259 +0,0 @@
    2-/* stb_image_resize2 - v2.17 - public domain image resizing
    3-
    4-   by Jeff Roberts (v2) and Jorge L Rodriguez
    5-   http://github.com/nothings/stb
    6-
    7-   Can be threaded with the extended API. SSE2, AVX, Neon and WASM SIMD support.
    8-   Only scaling and translation is supported, no rotations or shears.
    9-
   10-   COMPILING & LINKING
   11-      In one C/C++ file that #includes this file, do this:
   12-         #define STB_IMAGE_RESIZE_IMPLEMENTATION
   13-      before the #include. That will create the implementation in that file.
   14-
   15-   EASY API CALLS:
   16-     Easy API downsamples w/Mitchell filter, upsamples w/cubic interpolation,
   17-   clamps to edge.
   18-
   19-     stbir_resize_uint8_srgb( input_pixels,  input_w,  input_h,
   20-   input_stride_in_bytes, output_pixels, output_w, output_h,
   21-   output_stride_in_bytes, pixel_layout_enum )
   22-
   23-     stbir_resize_uint8_linear( input_pixels,  input_w,  input_h,
   24-   input_stride_in_bytes, output_pixels, output_w, output_h,
   25-   output_stride_in_bytes, pixel_layout_enum )
   26-
   27-     stbir_resize_float_linear( input_pixels,  input_w,  input_h,
   28-   input_stride_in_bytes, output_pixels, output_w, output_h,
   29-   output_stride_in_bytes, pixel_layout_enum )
   30-
   31-     If you pass NULL or zero for the output_pixels, we will allocate the output
   32-   buffer for you and return it from the function (free with free() or
   33-   STBIR_FREE). As a special case, XX_stride_in_bytes of 0 means packed
   34-   continuously in memory.
   35-
   36-   API LEVELS
   37-      There are three levels of API - easy-to-use, medium-complexity and
   38-   extended-complexity.
   39-
   40-      See the "header file" section of the source for API documentation.
   41-
   42-   ADDITIONAL DOCUMENTATION
   43-
   44-      MEMORY ALLOCATION
   45-         By default, we use malloc and free for memory allocation.  To override
   46-   the memory allocation, before the implementation #include, add a:
   47-
   48-            #define STBIR_MALLOC(size,user_data) ...
   49-            #define STBIR_FREE(ptr,user_data)   ...
   50-
   51-         Each resize makes exactly one call to malloc/free (unless you use the
   52-         extended API where you can do one allocation for many resizes). Under
   53-         address sanitizer, we do separate allocations to find overread/writes.
   54-
   55-      PERFORMANCE
   56-         This library was written with an emphasis on performance. When testing
   57-         stb_image_resize with RGBA, the fastest mode is STBIR_4CHANNEL with
   58-         STBIR_TYPE_UINT8 pixels and CLAMPed edges (which is what many other
   59-   resize libs do by default). Also, make sure SIMD is turned on of course
   60-   (default for 64-bit targets). Avoid WRAP edge mode if you want the fastest
   61-   speed.
   62-
   63-         This library also comes with profiling built-in. If you define
   64-   STBIR_PROFILE, you can use the advanced API and get low-level profiling
   65-   information by calling stbir_resize_extended_profile_info() or
   66-   stbir_resize_split_profile_info() after a resize.
   67-
   68-      SIMD
   69-         Most of the routines have optimized SSE2, AVX, NEON and WASM versions.
   70-
   71-         On Microsoft compilers, we automatically turn on SIMD for 64-bit x64
   72-   and ARM; for 32-bit x86 and ARM, you select SIMD mode by defining STBIR_SSE2
   73-   or STBIR_NEON. For AVX and AVX2, we auto-select it by detecting the /arch:AVX
   74-         or /arch:AVX2 switches. You can also always manually turn SSE2, AVX or
   75-   AVX2 support on by defining STBIR_SSE2, STBIR_AVX or STBIR_AVX2.
   76-
   77-         On Linux, SSE2 and Neon is on by default for 64-bit x64 or ARM64. For
   78-   32-bit, we select x86 SIMD mode by whether you have -msse2, -mavx or -mavx2
   79-   enabled on the command line. For 32-bit ARM, you must pass -mfpu=neon-vfpv4
   80-   for both clang and GCC, but GCC also requires an additional
   81-   -mfp16-format=ieee to automatically enable NEON.
   82-
   83-         On x86 platforms, you can also define STBIR_FP16C to turn on FP16C
   84-   instructions for converting back and forth to half-floats. This is
   85-   autoselected when we are using AVX2. Clang and GCC also require the -mf16c
   86-   switch. ARM always uses the built-in half float hardware NEON instructions.
   87-
   88-         You can also tell us to use multiply-add instructions with
   89-   STBIR_USE_FMA. Because x86 doesn't always have fma, we turn it off by default
   90-   to maintain determinism across all platforms. If you don't care about non-FMA
   91-   determinism and are willing to restrict yourself to more recent x86 CPUs
   92-   (around the AVX timeframe), then fma will give you around a 15% speedup.
   93-
   94-         You can force off SIMD in all cases by defining STBIR_NO_SIMD. You can
   95-   turn off AVX or AVX2 specifically with STBIR_NO_AVX or STBIR_NO_AVX2. AVX is
   96-   10% to 40% faster, and AVX2 is generally another 12%.
   97-
   98-      ALPHA CHANNEL
   99-         Most of the resizing functions provide the ability to control how the
  100-   alpha channel of an image is processed.
  101-
  102-         When alpha represents transparency, it is important that when combining
  103-         colors with filtering, the pixels should not be treated equally; they
  104-         should use a weighted average based on their alpha values. For example,
  105-         if a pixel is 1% opaque bright green and another pixel is 99% opaque
  106-         black and you average them, the average will be 50% opaque, but the
  107-         unweighted average and will be a middling green color, while the
  108-   weighted average will be nearly black. This means the unweighted version
  109-   introduced green energy that didn't exist in the source image.
  110-
  111-         (If you want to know why this makes sense, you can work out the math
  112-   for the following: consider what happens if you alpha composite a source
  113-   image over a fixed color and then average the output, vs. if you average the
  114-         source image pixels and then composite that over the same fixed color.
  115-         Only the weighted average produces the same result as the ground truth
  116-         composite-then-average result.)
  117-
  118-         Therefore, it is in general best to "alpha weight" the pixels when
  119-   applying filters to them. This essentially means multiplying the colors by
  120-   the alpha values before combining them, and then dividing by the alpha value
  121-   at the end.
  122-
  123-         The computer graphics industry introduced a technique called
  124-   "premultiplied alpha" or "associated alpha" in which image colors are stored
  125-   in image files already multiplied by their alpha. This saves some math when
  126-   compositing, and also avoids the need to divide by the alpha at the end
  127-   (which is quite inefficient). However, while premultiplied alpha is common in
  128-   the movie CGI industry, it is not commonplace in other industries like
  129-   videogames, and most consumer file formats are generally expected to contain
  130-   not-premultiplied colors. For example, Photoshop saves PNG files
  131-   "unpremultiplied", and web browsers like Chrome and Firefox expect PNG images
  132-   to be unpremultiplied.
  133-
  134-         Note that there are three possibilities that might describe your image
  135-         and resize expectation:
  136-
  137-             1. images are not premultiplied, alpha weighting is desired
  138-             2. images are not premultiplied, alpha weighting is not desired
  139-             3. images are premultiplied
  140-
  141-         Both case #2 and case #3 require the exact same math: no alpha
  142-   weighting should be applied or removed. Only case 1 requires extra math
  143-   operations; the other two cases can be handled identically.
  144-
  145-         stb_image_resize expects case #1 by default, applying alpha weighting
  146-   to images, expecting the input images to be unpremultiplied. This is what the
  147-         COLOR+ALPHA buffer types tell the resizer to do.
  148-
  149-         When you use the pixel layouts STBIR_RGBA, STBIR_BGRA, STBIR_ARGB,
  150-         STBIR_ABGR, STBIR_RX, or STBIR_XR you are telling us that the pixels
  151-   are non-premultiplied. In these cases, the resizer will alpha weight the
  152-   colors (effectively creating the premultiplied image), do the filtering, and
  153-   then convert back to non-premult on exit.
  154-
  155-         When you use the pixel layouts STBIR_RGBA_PM, STBIR_RGBA_PM,
  156-   STBIR_RGBA_PM, STBIR_RGBA_PM, STBIR_RX_PM or STBIR_XR_PM, you are telling
  157-   that the pixels ARE premultiplied. In this case, the resizer doesn't have to
  158-   do the premultipling - it can filter directly on the input. This about twice
  159-   as fast as the non-premultiplied case, so it's the right option if your data
  160-   is already setup correctly.
  161-
  162-         When you use the pixel layout STBIR_4CHANNEL or STBIR_2CHANNEL, you are
  163-         telling us that there is no channel that represents transparency; it
  164-   may be RGB and some unrelated fourth channel that has been stored in the
  165-   alpha channel, but it is actually not alpha. No special processing will be
  166-         performed.
  167-
  168-         The difference between the generic 4 or 2 channel layouts, and the
  169-         specialized _PM versions is with the _PM versions you are telling us
  170-   that the data *is* alpha, just don't premultiply it. That's important when
  171-         using SRGB pixel formats, we need to know where the alpha is, because
  172-         it is converted linearly (rather than with the SRGB converters).
  173-
  174-         Because alpha weighting produces the same effect as premultiplying, you
  175-         even have the option with non-premultiplied inputs to let the resizer
  176-         produce a premultiplied output. Because the intially computed
  177-   alpha-weighted output image is effectively premultiplied, this is actually
  178-   more performant than the normal path which un-premultiplies the output image
  179-   as a final step.
  180-
  181-         Finally, when converting both in and out of non-premulitplied space
  182-   (for example, when using STBIR_RGBA), we go to somewhat heroic measures to
  183-         ensure that areas with zero alpha value pixels get something reasonable
  184-         in the RGB values. If you don't care about the RGB values of zero alpha
  185-         pixels, you can call the stbir_set_non_pm_alpha_speed_over_quality()
  186-         function - this runs a premultiplied resize about 25% faster. That
  187-   said, when you really care about speed, using premultiplied pixels for both
  188-   in and out (STBIR_RGBA_PM, etc) much faster than both of these premultiplied
  189-         options.
  190-
  191-      PIXEL LAYOUT CONVERSION
  192-         The resizer can convert from some pixel layouts to others. When using
  193-   the stbir_set_pixel_layouts(), you can, for example, specify STBIR_RGBA on
  194-   input, and STBIR_ARGB on output, and it will re-organize the channels during
  195-   the resize. Currently, you can only convert between two pixel layouts with
  196-   the same number of channels.
  197-
  198-      DETERMINISM
  199-         We commit to being deterministic (from x64 to ARM to scalar to SIMD,
  200-   etc). This requires compiling with fast-math off (using at least
  201-   /fp:precise). Also, you must turn off fp-contracting (which turns mult+adds
  202-   into fmas)! We attempt to do this with pragmas, but with Clang, you usually
  203-   want to add -ffp-contract=off to the command line as well.
  204-
  205-         For 32-bit x86, you must use SSE and SSE2 codegen for determinism. That
  206-   is, if the scalar x87 unit gets used at all, we immediately lose determinism.
  207-         On Microsoft Visual Studio 2008 and earlier, from what we can tell
  208-   there is no way to be deterministic in 32-bit x86 (some x87 always leaks in,
  209-   even with fp:strict). On 32-bit x86 GCC, determinism requires both -msse2 and
  210-         -fpmath=sse.
  211-
  212-         Note that we will not be deterministic with float data containing NaNs
  213-   - the NaNs will propagate differently on different SIMD and platforms.
  214-
  215-         If you turn on STBIR_USE_FMA, then we will be deterministic with other
  216-         fma targets, but we will differ from non-fma targets (this is
  217-   unavoidable, because a fma isn't simply an add with a mult - it also
  218-   introduces a rounding difference compared to non-fma instruction sequences.
  219-
  220-      FLOAT PIXEL FORMAT RANGE
  221-         Any range of values can be used for the non-alpha float data that you
  222-   pass in (0 to 1, -1 to 1, whatever). However, if you are inputting float
  223-   values but *outputting* bytes or shorts, you must use a range of 0 to 1 so
  224-   that we scale back properly. The alpha channel must also be 0 to 1 for any
  225-   format that does premultiplication prior to resizing.
  226-
  227-         Note also that with float output, using filters with negative lobes,
  228-   the output filtered values might go slightly out of range. You can define
  229-         STBIR_FLOAT_LOW_CLAMP and/or STBIR_FLOAT_HIGH_CLAMP to specify the
  230-   range to clamp to on output, if that's important.
  231-
  232-      MAX/MIN SCALE FACTORS
  233-         The input pixel resolutions are in integers, and we do the internal
  234-   pointer resolution in size_t sized integers. However, the scale ratio from
  235-   input resolution to output resolution is calculated in float form. This means
  236-         the effective possible scale ratio is limited to 24 bits (or 16 million
  237-         to 1). As you get close to the size of the float resolution (again, 16
  238-         million pixels wide or high), you might start seeing float inaccuracy
  239-         issues in general in the pipeline. If you have to do extreme resizes,
  240-         you can usually do this is multiple stages (using float intermediate
  241-         buffers).
  242-
  243-      FLIPPED IMAGES
  244-         Stride is just the delta from one scanline to the next. This means you
  245-   can use a negative stride to handle inverted images (point to the final
  246-         scanline and use a negative stride). You can invert the input or
  247-   output, using negative strides.
  248-
  249-      DEFAULT FILTERS
  250-         For functions which don't provide explicit control over what filters to
  251-         use, you can change the compile-time defaults with:
  252-
  253-            #define STBIR_DEFAULT_FILTER_UPSAMPLE     STBIR_FILTER_something
  254-            #define STBIR_DEFAULT_FILTER_DOWNSAMPLE   STBIR_FILTER_something
  255-
  256-         See stbir_filter in the header-file section for the list of filters.
  257-
  258-      NEW FILTERS
  259-         A number of 1D filter kernels are supplied. For a list of supported
  260-         filters, see the stbir_filter enum. You can install your own filters by
  261-         using the stbir_set_filter_callbacks function.
  262-
  263-      PROGRESS
  264-         For interactive use with slow resize operations, you can use the
  265-         scanline callbacks in the extended API. It would have to be a *very*
  266-   large image resample to need progress though - we're very fast.
  267-
  268-      CEIL and FLOOR
  269-         In scalar mode, the only functions we use from math.h are ceilf and
  270-   floorf, but if you have your own versions, you can define the STBIR_CEILF(v)
  271-   and STBIR_FLOORF(v) macros and we'll use them instead. In SIMD, we just use
  272-         our own versions.
  273-
  274-      ASSERT
  275-         Define STBIR_ASSERT(boolval) to override assert() and not use assert.h
  276-
  277-     PORTING FROM VERSION 1
  278-        The API has changed. You can continue to use the old version of
  279-   stb_image_resize.h, which is available in the "deprecated/" directory.
  280-
  281-        If you're using the old simple-to-use API, porting is straightforward.
  282-        (For more advanced APIs, read the documentation.)
  283-
  284-          stbir_resize_uint8():
  285-            - call `stbir_resize_uint8_linear`, cast channel count to
  286-   `stbir_pixel_layout`
  287-
  288-          stbir_resize_float():
  289-            - call `stbir_resize_float_linear`, cast channel count to
  290-   `stbir_pixel_layout`
  291-
  292-          stbir_resize_uint8_srgb():
  293-            - function name is unchanged
  294-            - cast channel count to `stbir_pixel_layout`
  295-            - above is sufficient unless your image has alpha and it's not
  296-   RGBA/BGRA
  297-              - in that case, follow the below instructions for
  298-   stbir_resize_uint8_srgb_edgemode
  299-
  300-          stbir_resize_uint8_srgb_edgemode()
  301-            - switch to the "medium complexity" API
  302-            - stbir_resize(), very similar API but a few more parameters:
  303-              - pixel_layout: cast channel count to `stbir_pixel_layout`
  304-              - data_type:    STBIR_TYPE_UINT8_SRGB
  305-              - edge:         unchanged (STBIR_EDGE_WRAP, etc.)
  306-              - filter:       STBIR_FILTER_DEFAULT
  307-            - which channel is alpha is specified in stbir_pixel_layout, see
  308-   enum for details
  309-
  310-      FUTURE TODOS
  311-        *  For polyphase integral filters, we just memcpy the coeffs to dupe
  312-           them, but we should indirect and use the same coeff memory.
  313-        *  Add pixel layout conversions for sensible different channel counts
  314-           (maybe, 1->3/4, 3->4, 4->1, 3->1).
  315-         * For SIMD encode and decode scanline routines, do any pre-aligning
  316-           for bad input/output buffer alignments and pitch?
  317-         * For very wide scanlines, we should we do vertical strips to stay
  318-   within L2 cache. Maybe do chunks of 1K pixels at a time. There would be some
  319-   pixel reconversion, but probably dwarfed by things falling out of cache.
  320-   Probably also something possible with alternating between scattering and
  321-   gathering at high resize scales?
  322-         * Should we have a multiple MIPs at the same time function (could keep
  323-           more memory in cache during multiple resizes)?
  324-         * Rewrite the coefficient generator to do many at once.
  325-         * AVX-512 vertical kernels - worried about downclocking here.
  326-         * Convert the reincludes to macros when we know they aren't changing.
  327-         * Experiment with pivoting the horizontal and always using the
  328-           vertical filters (which are faster, but perhaps not enough to
  329-   overcome the pivot cost and the extra memory touches). Need to buffer the
  330-   whole image so have to balance memory use.
  331-         * Most of our code is internally function pointers, should we compile
  332-           all the SIMD stuff always and dynamically dispatch?
  333-
  334-   CONTRIBUTORS
  335-      Jeff Roberts: 2.0 implementation, optimizations, SIMD
  336-      Martins Mozeiko: NEON simd, WASM simd, clang and GCC whisperer
  337-      Fabian Giesen: half float and srgb converters
  338-      Sean Barrett: API design, optimizations
  339-      Jorge L Rodriguez: Original 1.0 implementation
  340-      Aras Pranckevicius: bugfixes
  341-      Nathan Reed: warning fixes for 1.0
  342-
  343-   REVISIONS
  344-      2.17 (2025-10-25) silly format bug in easy-to-use APIs.
  345-      2.16 (2025-10-21) fixed the easy-to-use APIs to allow inverted bitmaps
  346-   (negative strides), fix vertical filter kernel callback, fix threaded gather
  347-   buffer priming (and assert). (thanks adipose, TainZerL, and Harrison Green)
  348-      2.15 (2025-07-17) fixed an assert in debug mode when using floats with
  349-   input callbacks, work around GCC warning when adding to null ptr (thanks
  350-   Johannes Spohr and Pyry Kovanen). 2.14 (2025-05-09) fixed a bug using
  351-   downsampling gather horizontal first, and scatter with vertical first. 2.13
  352-   (2025-02-27) fixed a bug when using input callbacks, turned off simd for
  353-                          tiny-c, fixed some variables that should have been
  354-   static, fixes a bug when calculating temp memory with resizes that exceed 2GB
  355-   of temp memory (very large resizes). 2.12 (2024-10-18) fix incorrect use of
  356-   user_data with STBIR_FREE 2.11 (2024-09-08) fix harmless asan warnings in
  357-   2-channel and 3-channel mode with AVX-2, fix some weird scaling edge
  358-   conditions with point sample mode. 2.10 (2024-07-27) fix the defines GCC and
  359-   mingw for loop unroll control, fix MSVC 32-bit arm half float routines. 2.09
  360-   (2024-06-19) fix the defines for 32-bit ARM GCC builds (was selecting
  361-                          hardware half floats).
  362-      2.08 (2024-06-10) fix for RGB->BGR three channel flips and add SIMD
  363-   (thanks to Ryan Salsbury), fix for sub-rect resizes, use the pragmas to
  364-   control unrolling when they are available. 2.07 (2024-05-24) fix for slow
  365-   final split during threaded conversions of very wide scanlines when
  366-   downsampling (caused by extra input converting), fix for wide scanline
  367-   resamples with many splits (int overflow), fix GCC warning. 2.06 (2024-02-10)
  368-   fix for identical width/height 3x or more down-scaling undersampling a single
  369-   row on rare resize ratios (about 1%). 2.05 (2024-02-07) fix for 2 pixel to 1
  370-   pixel resizes with wrap (thanks Aras), fix for output callback (thanks Julien
  371-   Koenen). 2.04 (2023-11-17) fix for rare AVX bug, shadowed symbol (thanks
  372-   Nikola Smiljanic). 2.03 (2023-11-01) ASAN and TSAN warnings fixed, minor
  373-   tweaks. 2.00 (2023-10-10) mostly new source: new api, optimizations, simd,
  374-   vertical-first, etc 2x-5x faster without simd, 4x-12x faster with simd, in
  375-   some cases, 20x to 40x faster esp resizing large to very small. 0.96
  376-   (2019-03-04) fixed warnings 0.95 (2017-07-23) fixed warnings 0.94
  377-   (2017-03-18) fixed warnings 0.93 (2017-03-03) fixed bug with certain
  378-   combinations of heights 0.92 (2017-01-02) fix integer overflow on large
  379-   (>2GB) images 0.91 (2016-04-02) fix warnings; fix handling of subpixel
  380-   regions 0.90 (2014-09-17) first released version
  381-
  382-   LICENSE
  383-     See end of file for license information.
  384-*/
  385-
  386-#if !defined(STB_IMAGE_RESIZE_DO_HORIZONTALS) &&                               \
  387-    !defined(STB_IMAGE_RESIZE_DO_VERTICALS) &&                                 \
  388-    !defined(STB_IMAGE_RESIZE_DO_CODERS) // for internal re-includes
  389-
  390-#ifndef STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
  391-#define STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
  392-
  393-#include <stddef.h>
  394-#ifdef _MSC_VER
  395-typedef unsigned char stbir_uint8;
  396-typedef unsigned short stbir_uint16;
  397-typedef unsigned int stbir_uint32;
  398-typedef unsigned __int64 stbir_uint64;
  399-#else
  400-#include <stdint.h>
  401-typedef uint8_t stbir_uint8;
  402-typedef uint16_t stbir_uint16;
  403-typedef uint32_t stbir_uint32;
  404-typedef uint64_t stbir_uint64;
  405-#endif
  406-
  407-#ifndef STBIRDEF
  408-#ifdef STB_IMAGE_RESIZE_STATIC
  409-#define STBIRDEF static
  410-#else
  411-#ifdef __cplusplus
  412-#define STBIRDEF extern "C"
  413-#else
  414-#define STBIRDEF extern
  415-#endif
  416-#endif
  417-#endif
  418-
  419-//////////////////////////////////////////////////////////////////////////////
  420-////   start "header file" ///////////////////////////////////////////////////
  421-//
  422-// Easy-to-use API:
  423-//
  424-//     * stride is the offset between successive rows of image data
  425-//        in memory, in bytes. specify 0 for packed continuously in memory
  426-//     * colorspace is linear or sRGB as specified by function name
  427-//     * Uses the default filters
  428-//     * Uses edge mode clamped
  429-//     * returned result is 1 for success or 0 in case of an error.
  430-
  431-// stbir_pixel_layout specifies:
  432-//   number of channels
  433-//   order of channels
  434-//   whether color is premultiplied by alpha
  435-// for back compatibility, you can cast the old channel count to an
  436-// stbir_pixel_layout
  437-typedef enum {
  438-	STBIR_1CHANNEL = 1,
  439-	STBIR_2CHANNEL = 2,
  440-	STBIR_RGB = 3, // 3-chan, with order specified (for channel flipping)
  441-	STBIR_BGR = 0, // 3-chan, with order specified (for channel flipping)
  442-	STBIR_4CHANNEL = 5,
  443-
  444-	STBIR_RGBA = 4, // alpha formats, where alpha is NOT premultiplied into
  445-	                // color channels
  446-	STBIR_BGRA = 6,
  447-	STBIR_ARGB = 7,
  448-	STBIR_ABGR = 8,
  449-	STBIR_RA = 9,
  450-	STBIR_AR = 10,
  451-
  452-	STBIR_RGBA_PM =
  453-	    11, // alpha formats, where alpha is premultiplied into color channels
  454-	STBIR_BGRA_PM = 12,
  455-	STBIR_ARGB_PM = 13,
  456-	STBIR_ABGR_PM = 14,
  457-	STBIR_RA_PM = 15,
  458-	STBIR_AR_PM = 16,
  459-
  460-	STBIR_RGBA_NO_AW =
  461-	    11, // alpha formats, where NO alpha weighting is applied at all!
  462-	STBIR_BGRA_NO_AW =
  463-	    12, //   these are just synonyms for the _PM flags (which also do
  464-	STBIR_ARGB_NO_AW =
  465-	    13, //   no alpha weighting). These names just make it more clear
  466-	STBIR_ABGR_NO_AW = 14, //   for some folks).
  467-	STBIR_RA_NO_AW = 15,
  468-	STBIR_AR_NO_AW = 16,
  469-
  470-} stbir_pixel_layout;
  471-
  472-//===============================================================
  473-//  Simple-complexity API
  474-//
  475-//    If output_pixels is NULL (0), then we will allocate the buffer and return
  476-//    it to you.
  477-//--------------------------------
  478-
  479-STBIRDEF unsigned char *
  480-stbir_resize_uint8_srgb(const unsigned char *input_pixels, int input_w,
  481-                        int input_h, int input_stride_in_bytes,
  482-                        unsigned char *output_pixels, int output_w,
  483-                        int output_h, int output_stride_in_bytes,
  484-                        stbir_pixel_layout pixel_type);
  485-
  486-STBIRDEF unsigned char *
  487-stbir_resize_uint8_linear(const unsigned char *input_pixels, int input_w,
  488-                          int input_h, int input_stride_in_bytes,
  489-                          unsigned char *output_pixels, int output_w,
  490-                          int output_h, int output_stride_in_bytes,
  491-                          stbir_pixel_layout pixel_type);
  492-
  493-STBIRDEF float *
  494-stbir_resize_float_linear(const float *input_pixels, int input_w, int input_h,
  495-                          int input_stride_in_bytes, float *output_pixels,
  496-                          int output_w, int output_h,
  497-                          int output_stride_in_bytes,
  498-                          stbir_pixel_layout pixel_type);
  499-//===============================================================
  500-
  501-//===============================================================
  502-// Medium-complexity API
  503-//
  504-// This extends the easy-to-use API as follows:
  505-//
  506-//     * Can specify the datatype - U8, U8_SRGB, U16, FLOAT, HALF_FLOAT
  507-//     * Edge wrap can selected explicitly
  508-//     * Filter can be selected explicitly
  509-//--------------------------------
  510-
  511-typedef enum {
  512-	STBIR_EDGE_CLAMP = 0,
  513-	STBIR_EDGE_REFLECT = 1,
  514-	STBIR_EDGE_WRAP = 2, // this edge mode is slower and uses more memory
  515-	STBIR_EDGE_ZERO = 3,
  516-} stbir_edge;
  517-
  518-typedef enum {
  519-	STBIR_FILTER_DEFAULT =
  520-	    0,                // use same filter type that easy-to-use API chooses
  521-	STBIR_FILTER_BOX = 1, // A trapezoid w/1-pixel wide ramps, same result as
  522-	                      // box for integer scale ratios
  523-	STBIR_FILTER_TRIANGLE =
  524-	    2, // On upsampling, produces same results as bilinear texture filtering
  525-	STBIR_FILTER_CUBICBSPLINE =
  526-	    3, // The cubic b-spline (aka Mitchell-Netrevalli with B=1,C=0),
  527-	       // gaussian-esque
  528-	STBIR_FILTER_CATMULLROM = 4, // An interpolating cubic spline
  529-	STBIR_FILTER_MITCHELL = 5,   // Mitchell-Netrevalli filter with B=1/3, C=1/3
  530-	STBIR_FILTER_POINT_SAMPLE = 6, // Simple point sampling
  531-	STBIR_FILTER_OTHER = 7,        // User callback specified
  532-} stbir_filter;
  533-
  534-typedef enum {
  535-	STBIR_TYPE_UINT8 = 0,
  536-	STBIR_TYPE_UINT8_SRGB = 1,
  537-	STBIR_TYPE_UINT8_SRGB_ALPHA = 2, // alpha channel, when present, should also
  538-	                                 // be SRGB (this is very unusual)
  539-	STBIR_TYPE_UINT16 = 3,
  540-	STBIR_TYPE_FLOAT = 4,
  541-	STBIR_TYPE_HALF_FLOAT = 5
  542-} stbir_datatype;
  543-
  544-// medium api
  545-STBIRDEF void *
  546-stbir_resize(const void *input_pixels, int input_w, int input_h,
  547-             int input_stride_in_bytes, void *output_pixels, int output_w,
  548-             int output_h, int output_stride_in_bytes,
  549-             stbir_pixel_layout pixel_layout, stbir_datatype data_type,
  550-             stbir_edge edge, stbir_filter filter);
  551-//===============================================================
  552-
  553-//===============================================================
  554-// Extended-complexity API
  555-//
  556-// This API exposes all resize functionality.
  557-//
  558-//     * Separate filter types for each axis
  559-//     * Separate edge modes for each axis
  560-//     * Separate input and output data types
  561-//     * Can specify regions with subpixel correctness
  562-//     * Can specify alpha flags
  563-//     * Can specify a memory callback
  564-//     * Can specify a callback data type for pixel input and output
  565-//     * Can be threaded for a single resize
  566-//     * Can be used to resize many frames without recalculating the sampler
  567-//     info
  568-//
  569-//  Use this API as follows:
  570-//     1) Call the stbir_resize_init function on a local STBIR_RESIZE structure
  571-//     2) Call any of the stbir_set functions
  572-//     3) Optionally call stbir_build_samplers() if you are going to resample
  573-//     multiple times
  574-//        with the same input and output dimensions (like resizing video frames)
  575-//     4) Resample by calling stbir_resize_extended().
  576-//     5) Call stbir_free_samplers() if you called stbir_build_samplers()
  577-//--------------------------------
  578-
  579-// Types:
  580-
  581-// INPUT CALLBACK: this callback is used for input scanlines
  582-typedef void const *
  583-stbir_input_callback(void *optional_output, void const *input_ptr,
  584-                     int num_pixels, int x, int y, void *context);
  585-
  586-// OUTPUT CALLBACK: this callback is used for output scanlines
  587-typedef void
  588-stbir_output_callback(void const *output_ptr, int num_pixels, int y,
  589-                      void *context);
  590-
  591-// callbacks for user installed filters
  592-typedef float
  593-stbir__kernel_callback(float x, float scale,
  594-                       void *user_data); // centered at zero
  595-typedef float
  596-stbir__support_callback(float scale, void *user_data);
  597-
  598-// internal structure with precomputed scaling
  599-typedef struct stbir__info stbir__info;
  600-
  601-typedef struct STBIR_RESIZE // use the stbir_resize_init and stbir_override
  602-                            // functions to set these values for future
  603-                            // compatibility
  604-{
  605-	void *user_data;
  606-	void const *input_pixels;
  607-	int input_w, input_h;
  608-	double input_s0, input_t0, input_s1, input_t1;
  609-	stbir_input_callback *input_cb;
  610-	void *output_pixels;
  611-	int output_w, output_h;
  612-	int output_subx, output_suby, output_subw, output_subh;
  613-	stbir_output_callback *output_cb;
  614-	int input_stride_in_bytes;
  615-	int output_stride_in_bytes;
  616-	int splits;
  617-	int fast_alpha;
  618-	int needs_rebuild;
  619-	int called_alloc;
  620-	stbir_pixel_layout input_pixel_layout_public;
  621-	stbir_pixel_layout output_pixel_layout_public;
  622-	stbir_datatype input_data_type;
  623-	stbir_datatype output_data_type;
  624-	stbir_filter horizontal_filter, vertical_filter;
  625-	stbir_edge horizontal_edge, vertical_edge;
  626-	stbir__kernel_callback *horizontal_filter_kernel;
  627-	stbir__support_callback *horizontal_filter_support;
  628-	stbir__kernel_callback *vertical_filter_kernel;
  629-	stbir__support_callback *vertical_filter_support;
  630-	stbir__info *samplers;
  631-} STBIR_RESIZE;
  632-
  633-// extended complexity api
  634-
  635-// First off, you must ALWAYS call stbir_resize_init on your resize structure
  636-// before any of the other calls!
  637-STBIRDEF void
  638-stbir_resize_init(STBIR_RESIZE *resize, const void *input_pixels, int input_w,
  639-                  int input_h, int input_stride_in_bytes, // stride can be zero
  640-                  void *output_pixels, int output_w, int output_h,
  641-                  int output_stride_in_bytes, // stride can be zero
  642-                  stbir_pixel_layout pixel_layout, stbir_datatype data_type);
  643-
  644-//===============================================================
  645-// You can update these parameters any time after resize_init and there is no
  646-// cost
  647-//--------------------------------
  648-
  649-STBIRDEF void
  650-stbir_set_datatypes(STBIR_RESIZE *resize, stbir_datatype input_type,
  651-                    stbir_datatype output_type);
  652-STBIRDEF void
  653-stbir_set_pixel_callbacks(
  654-    STBIR_RESIZE *resize, stbir_input_callback *input_cb,
  655-    stbir_output_callback *output_cb); // no callbacks by default
  656-STBIRDEF void
  657-stbir_set_user_data(STBIR_RESIZE *resize,
  658-                    void *user_data); // pass back STBIR_RESIZE* by default
  659-STBIRDEF void
  660-stbir_set_buffer_ptrs(STBIR_RESIZE *resize, const void *input_pixels,
  661-                      int input_stride_in_bytes, void *output_pixels,
  662-                      int output_stride_in_bytes);
  663-
  664-//===============================================================
  665-
  666-//===============================================================
  667-// If you call any of these functions, you will trigger a sampler rebuild!
  668-//--------------------------------
  669-
  670-STBIRDEF int
  671-stbir_set_pixel_layouts(
  672-    STBIR_RESIZE *resize, stbir_pixel_layout input_pixel_layout,
  673-    stbir_pixel_layout output_pixel_layout); // sets new buffer layouts
  674-STBIRDEF int
  675-stbir_set_edgemodes(STBIR_RESIZE *resize, stbir_edge horizontal_edge,
  676-                    stbir_edge vertical_edge); // CLAMP by default
  677-
  678-STBIRDEF int
  679-stbir_set_filters(STBIR_RESIZE *resize, stbir_filter horizontal_filter,
  680-                  stbir_filter vertical_filter); // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE
  681-                                                 // by default
  682-STBIRDEF int
  683-stbir_set_filter_callbacks(STBIR_RESIZE *resize,
  684-                           stbir__kernel_callback *horizontal_filter,
  685-                           stbir__support_callback *horizontal_support,
  686-                           stbir__kernel_callback *vertical_filter,
  687-                           stbir__support_callback *vertical_support);
  688-
  689-STBIRDEF int
  690-stbir_set_pixel_subrect(
  691-    STBIR_RESIZE *resize, int subx, int suby, int subw,
  692-    int subh); // sets both sub-regions (full regions by default)
  693-STBIRDEF int
  694-stbir_set_input_subrect(
  695-    STBIR_RESIZE *resize, double s0, double t0, double s1,
  696-    double t1); // sets input sub-region (full region by default)
  697-STBIRDEF int
  698-stbir_set_output_pixel_subrect(
  699-    STBIR_RESIZE *resize, int subx, int suby, int subw,
  700-    int subh); // sets output sub-region (full region by default)
  701-
  702-// when inputting AND outputting non-premultiplied alpha pixels, we use a slower
  703-// but higher quality technique
  704-//   that fills the zero alpha pixel's RGB values with something plausible.  If
  705-//   you don't care about areas of zero alpha, you can call this function to get
  706-//   about a 25% speed improvement for STBIR_RGBA to STBIR_RGBA types of
  707-//   resizes.
  708-STBIRDEF int
  709-stbir_set_non_pm_alpha_speed_over_quality(STBIR_RESIZE *resize,
  710-                                          int non_pma_alpha_speed_over_quality);
  711-//===============================================================
  712-
  713-//===============================================================
  714-// You can call build_samplers to prebuild all the internal data we need to
  715-// resample.
  716-//   Then, if you call resize_extended many times with the same resize, you only
  717-//   pay the cost once.
  718-// If you do call build_samplers, you MUST call free_samplers eventually.
  719-//--------------------------------
  720-
  721-// This builds the samplers and does one allocation
  722-STBIRDEF int
  723-stbir_build_samplers(STBIR_RESIZE *resize);
  724-
  725-// You MUST call this, if you call stbir_build_samplers or
  726-// stbir_build_samplers_with_splits
  727-STBIRDEF void
  728-stbir_free_samplers(STBIR_RESIZE *resize);
  729-//===============================================================
  730-
  731-// And this is the main function to perform the resize synchronously on one
  732-// thread.
  733-STBIRDEF int
  734-stbir_resize_extended(STBIR_RESIZE *resize);
  735-
  736-//===============================================================
  737-// Use these functions for multithreading.
  738-//   1) You call stbir_build_samplers_with_splits first on the main thread
  739-//   2) Then stbir_resize_with_split on each thread
  740-//   3) stbir_free_samplers when done on the main thread
  741-//--------------------------------
  742-
  743-// This will build samplers for threading.
  744-//   You can pass in the number of threads you'd like to use (try_splits).
  745-//   It returns the number of splits (threads) that you can call it with.
  746-///  It might be less if the image resize can't be split up that many ways.
  747-
  748-STBIRDEF int
  749-stbir_build_samplers_with_splits(STBIR_RESIZE *resize, int try_splits);
  750-
  751-// This function does a split of the resizing (you call this fuction for each
  752-// split, on multiple threads). A split is a piece of the output resize pixel
  753-// space.
  754-
  755-// Note that you MUST call stbir_build_samplers_with_splits before
  756-// stbir_resize_extended_split!
  757-
  758-// Usually, you will always call stbir_resize_split with split_start as the
  759-// thread_index
  760-//   and "1" for the split_count.
  761-// But, if you have a weird situation where you MIGHT want 8 threads, but
  762-// sometimes
  763-//   only 4 threads, you can use 0,2,4,6 for the split_start's and use "2" for
  764-//   the split_count each time to turn in into a 4 thread resize. (This is
  765-//   unusual).
  766-
  767-STBIRDEF int
  768-stbir_resize_extended_split(STBIR_RESIZE *resize, int split_start,
  769-                            int split_count);
  770-//===============================================================
  771-
  772-//===============================================================
  773-// Pixel Callbacks info:
  774-//--------------------------------
  775-
  776-//   The input callback is super flexible - it calls you with the input address
  777-//   (based on the stride and base pointer), it gives you an optional_output
  778-//   pointer that you can fill, or you can just return your own pointer into
  779-//   your own data.
  780-//
  781-//   You can also do conversion from non-supported data types if necessary - in
  782-//   this case, you ignore the input_ptr and just use the x and y parameters to
  783-//   calculate your own input_ptr based on the size of each non-supported pixel.
  784-//   (Something like the third example below.)
  785-//
  786-//   You can also install just an input or just an output callback by setting
  787-//   the callback that you don't want to zero.
  788-//
  789-//     First example, progress: (getting a callback that you can monitor the
  790-//     progress):
  791-//        void const * my_callback( void * optional_output, void const *
  792-//        input_ptr, int num_pixels, int x, int y, void * context )
  793-//        {
  794-//           percentage_done = y / input_height;
  795-//           return input_ptr;  // use buffer from call
  796-//        }
  797-//
  798-//     Next example, copying: (copy from some other buffer or stream):
  799-//        void const * my_callback( void * optional_output, void const *
  800-//        input_ptr, int num_pixels, int x, int y, void * context )
  801-//        {
  802-//           CopyOrStreamData( optional_output, other_data_src, num_pixels *
  803-//           pixel_width_in_bytes ); return optional_output;  // return the
  804-//           optional buffer that we filled
  805-//        }
  806-//
  807-//     Third example, input another buffer without copying: (zero-copy from
  808-//     other buffer):
  809-//        void const * my_callback( void * optional_output, void const *
  810-//        input_ptr, int num_pixels, int x, int y, void * context )
  811-//        {
  812-//           void * pixels = ( (char*) other_image_base ) + ( y *
  813-//           other_image_stride ) + ( x * other_pixel_width_in_bytes ); return
  814-//           pixels;       // return pointer to your data without copying
  815-//        }
  816-//
  817-//
  818-//   The output callback is considerably simpler - it just calls you so that you
  819-//   can dump out each scanline. You could even directly copy out to disk if you
  820-//   have a simple format like TGA or BMP. You can also convert to other output
  821-//   types here if you want.
  822-//
  823-//   Simple example:
  824-//        void const * my_output( void * output_ptr, int num_pixels, int y, void
  825-//        * context )
  826-//        {
  827-//           percentage_done = y / output_height;
  828-//           fwrite( output_ptr, pixel_width_in_bytes, num_pixels, output_file
  829-//           );
  830-//        }
  831-//===============================================================
  832-
  833-//===============================================================
  834-// optional built-in profiling API
  835-//--------------------------------
  836-
  837-#ifdef STBIR_PROFILE
  838-
  839-typedef struct STBIR_PROFILE_INFO {
  840-	stbir_uint64 total_clocks;
  841-
  842-	// how many clocks spent (of total_clocks) in the various resize routines,
  843-	// along with a string description
  844-	//    there are "resize_count" number of zones
  845-	stbir_uint64 clocks[8];
  846-	char const **descriptions;
  847-
  848-	// count of clocks and descriptions
  849-	stbir_uint32 count;
  850-} STBIR_PROFILE_INFO;
  851-
  852-// use after calling stbir_resize_extended (or stbir_build_samplers or
  853-// stbir_build_samplers_with_splits)
  854-STBIRDEF void
  855-stbir_resize_build_profile_info(STBIR_PROFILE_INFO *out_info,
  856-                                STBIR_RESIZE const *resize);
  857-
  858-// use after calling stbir_resize_extended
  859-STBIRDEF void
  860-stbir_resize_extended_profile_info(STBIR_PROFILE_INFO *out_info,
  861-                                   STBIR_RESIZE const *resize);
  862-
  863-// use after calling stbir_resize_extended_split
  864-STBIRDEF void
  865-stbir_resize_split_profile_info(STBIR_PROFILE_INFO *out_info,
  866-                                STBIR_RESIZE const *resize, int split_start,
  867-                                int split_num);
  868-
  869-//===============================================================
  870-
  871-#endif
  872-
  873-////   end header file   /////////////////////////////////////////////////////
  874-#endif // STBIR_INCLUDE_STB_IMAGE_RESIZE2_H
  875-
  876-#if defined(STB_IMAGE_RESIZE_IMPLEMENTATION) ||                                \
  877-    defined(STB_IMAGE_RESIZE2_IMPLEMENTATION)
  878-
  879-#ifndef STBIR_ASSERT
  880-#include <assert.h>
  881-#define STBIR_ASSERT(x) assert(x)
  882-#endif
  883-
  884-#ifndef STBIR_MALLOC
  885-#include <stdlib.h>
  886-#define STBIR_MALLOC(size, user_data) ((void)(user_data), malloc(size))
  887-#define STBIR_FREE(ptr, user_data) ((void)(user_data), free(ptr))
  888-// (we used the comma operator to evaluate user_data, to avoid "unused
  889-// parameter" warnings)
  890-#endif
  891-
  892-#ifdef _MSC_VER
  893-
  894-#define stbir__inline __forceinline
  895-
  896-#else
  897-
  898-#define stbir__inline __inline__
  899-
  900-// Clang address sanitizer
  901-#if defined(__has_feature)
  902-#if __has_feature(address_sanitizer) || __has_feature(memory_sanitizer)
  903-#ifndef STBIR__SEPARATE_ALLOCATIONS
  904-#define STBIR__SEPARATE_ALLOCATIONS
  905-#endif
  906-#endif
  907-#endif
  908-
  909-#endif
  910-
  911-// GCC and MSVC
  912-#if defined(__SANITIZE_ADDRESS__)
  913-#ifndef STBIR__SEPARATE_ALLOCATIONS
  914-#define STBIR__SEPARATE_ALLOCATIONS
  915-#endif
  916-#endif
  917-
  918-// Always turn off automatic FMA use - use STBIR_USE_FMA if you want.
  919-// Otherwise, this is a determinism disaster.
  920-#ifndef STBIR_DONT_CHANGE_FP_CONTRACT // override in case you don't want this
  921-                                      // behavior
  922-#if defined(_MSC_VER) && !defined(__clang__)
  923-#if _MSC_VER > 1200
  924-#pragma fp_contract(off)
  925-#endif
  926-#elif defined(__GNUC__) && !defined(__clang__)
  927-#pragma GCC optimize("fp-contract=off")
  928-#else
  929-#pragma STDC FP_CONTRACT OFF
  930-#endif
  931-#endif
  932-
  933-#ifdef _MSC_VER
  934-#define STBIR__UNUSED(v) (void)(v)
  935-#else
  936-#define STBIR__UNUSED(v) (void)sizeof(v)
  937-#endif
  938-
  939-#define STBIR__ARRAY_SIZE(a) (sizeof((a)) / sizeof((a)[0]))
  940-
  941-#ifndef STBIR_DEFAULT_FILTER_UPSAMPLE
  942-#define STBIR_DEFAULT_FILTER_UPSAMPLE STBIR_FILTER_CATMULLROM
  943-#endif
  944-
  945-#ifndef STBIR_DEFAULT_FILTER_DOWNSAMPLE
  946-#define STBIR_DEFAULT_FILTER_DOWNSAMPLE STBIR_FILTER_MITCHELL
  947-#endif
  948-
  949-#ifndef STBIR__HEADER_FILENAME
  950-#define STBIR__HEADER_FILENAME "stb_image_resize2.h"
  951-#endif
  952-
  953-// the internal pixel layout enums are in a different order, so we can easily do
  954-// range comparisons of types
  955-//   the public pixel layout is ordered in a way that if you cast num_channels
  956-//   (1-4) to the enum, you get something sensible
  957-typedef enum {
  958-	STBIRI_1CHANNEL = 0,
  959-	STBIRI_2CHANNEL = 1,
  960-	STBIRI_RGB = 2,
  961-	STBIRI_BGR = 3,
  962-	STBIRI_4CHANNEL = 4,
  963-
  964-	STBIRI_RGBA = 5,
  965-	STBIRI_BGRA = 6,
  966-	STBIRI_ARGB = 7,
  967-	STBIRI_ABGR = 8,
  968-	STBIRI_RA = 9,
  969-	STBIRI_AR = 10,
  970-
  971-	STBIRI_RGBA_PM = 11,
  972-	STBIRI_BGRA_PM = 12,
  973-	STBIRI_ARGB_PM = 13,
  974-	STBIRI_ABGR_PM = 14,
  975-	STBIRI_RA_PM = 15,
  976-	STBIRI_AR_PM = 16,
  977-} stbir_internal_pixel_layout;
  978-
  979-// define the public pixel layouts to not compile inside the implementation (to
  980-// avoid accidental use)
  981-#define STBIR_BGR bad_dont_use_in_implementation
  982-#define STBIR_1CHANNEL STBIR_BGR
  983-#define STBIR_2CHANNEL STBIR_BGR
  984-#define STBIR_RGB STBIR_BGR
  985-#define STBIR_RGBA STBIR_BGR
  986-#define STBIR_4CHANNEL STBIR_BGR
  987-#define STBIR_BGRA STBIR_BGR
  988-#define STBIR_ARGB STBIR_BGR
  989-#define STBIR_ABGR STBIR_BGR
  990-#define STBIR_RA STBIR_BGR
  991-#define STBIR_AR STBIR_BGR
  992-#define STBIR_RGBA_PM STBIR_BGR
  993-#define STBIR_BGRA_PM STBIR_BGR
  994-#define STBIR_ARGB_PM STBIR_BGR
  995-#define STBIR_ABGR_PM STBIR_BGR
  996-#define STBIR_RA_PM STBIR_BGR
  997-#define STBIR_AR_PM STBIR_BGR
  998-
  999-// must match stbir_datatype
 1000-static unsigned char stbir__type_size[] = {
 1001-    1, 1, 1, 2,
 1002-    4, 2 // STBIR_TYPE_UINT8,STBIR_TYPE_UINT8_SRGB,STBIR_TYPE_UINT8_SRGB_ALPHA,STBIR_TYPE_UINT16,STBIR_TYPE_FLOAT,STBIR_TYPE_HALF_FLOAT
 1003-};
 1004-
 1005-// When gathering, the contributors are which source pixels contribute.
 1006-// When scattering, the contributors are which destination pixels are
 1007-// contributed to.
 1008-typedef struct {
 1009-	int n0; // First contributing pixel
 1010-	int n1; // Last contributing pixel
 1011-} stbir__contributors;
 1012-
 1013-typedef struct {
 1014-	int lowest;  // First sample index for whole filter
 1015-	int highest; // Last sample index for whole filter
 1016-	int widest;  // widest single set of samples for an output
 1017-} stbir__filter_extent_info;
 1018-
 1019-typedef struct {
 1020-	int n0;                     // First pixel of decode buffer to write to
 1021-	int n1;                     // Last pixel of decode that will be written to
 1022-	int pixel_offset_for_input; // Pixel offset into input_scanline
 1023-} stbir__span;
 1024-
 1025-typedef struct stbir__scale_info {
 1026-	int input_full_size;
 1027-	int output_sub_size;
 1028-	float scale;
 1029-	float inv_scale;
 1030-	float pixel_shift; // starting shift in output pixel space (in pixels)
 1031-	int scale_is_rational;
 1032-	stbir_uint32 scale_numerator, scale_denominator;
 1033-} stbir__scale_info;
 1034-
 1035-typedef struct {
 1036-	stbir__contributors *contributors;
 1037-	float *coefficients;
 1038-	stbir__contributors *gather_prescatter_contributors;
 1039-	float *gather_prescatter_coefficients;
 1040-	stbir__scale_info scale_info;
 1041-	float support;
 1042-	stbir_filter filter_enum;
 1043-	stbir__kernel_callback *filter_kernel;
 1044-	stbir__support_callback *filter_support;
 1045-	stbir_edge edge;
 1046-	int coefficient_width;
 1047-	int filter_pixel_width;
 1048-	int filter_pixel_margin;
 1049-	int num_contributors;
 1050-	int contributors_size;
 1051-	int coefficients_size;
 1052-	stbir__filter_extent_info extent_info;
 1053-	int is_gather; // 0 = scatter, 1 = gather with scale >= 1, 2 = gather with
 1054-	               // scale < 1
 1055-	int gather_prescatter_num_contributors;
 1056-	int gather_prescatter_coefficient_width;
 1057-	int gather_prescatter_contributors_size;
 1058-	int gather_prescatter_coefficients_size;
 1059-} stbir__sampler;
 1060-
 1061-typedef struct {
 1062-	stbir__contributors conservative;
 1063-	int edge_sizes[2];    // this can be less than filter_pixel_margin, if the
 1064-	                      // filter and scaling falls off
 1065-	stbir__span spans[2]; // can be two spans, if doing input subrect with clamp
 1066-	                      // mode WRAP
 1067-} stbir__extents;
 1068-
 1069-typedef struct {
 1070-#ifdef STBIR_PROFILE
 1071-	union {
 1072-		struct {
 1073-			stbir_uint64 total, looping, vertical, horizontal, decode, encode,
 1074-			    alpha, unalpha;
 1075-		} named;
 1076-		stbir_uint64 array[8];
 1077-	} profile;
 1078-	stbir_uint64 *current_zone_excluded_ptr;
 1079-#endif
 1080-	float *decode_buffer;
 1081-
 1082-	int ring_buffer_first_scanline;
 1083-	int ring_buffer_last_scanline;
 1084-	int ring_buffer_begin_index; // first_scanline is at this index in the ring
 1085-	                             // buffer
 1086-	int start_output_y, end_output_y;
 1087-	int start_input_y, end_input_y; // used in scatter only
 1088-
 1089-#ifdef STBIR__SEPARATE_ALLOCATIONS
 1090-	float **ring_buffers; // one pointer for each ring buffer
 1091-#else
 1092-	float *ring_buffer; // one big buffer that we index into
 1093-#endif
 1094-
 1095-	float *vertical_buffer;
 1096-
 1097-	char no_cache_straddle[64];
 1098-} stbir__per_split_info;
 1099-
 1100-typedef float *
 1101-stbir__decode_pixels_func(float *decode, int width_times_channels,
 1102-                          void const *input);
 1103-typedef void
 1104-stbir__alpha_weight_func(float *decode_buffer, int width_times_channels);
 1105-typedef void
 1106-stbir__horizontal_gather_channels_func(
 1107-    float *output_buffer, unsigned int output_sub_size,
 1108-    float const *decode_buffer,
 1109-    stbir__contributors const *horizontal_contributors,
 1110-    float const *horizontal_coefficients, int coefficient_width);
 1111-typedef void
 1112-stbir__alpha_unweight_func(float *encode_buffer, int width_times_channels);
 1113-typedef void
 1114-stbir__encode_pixels_func(void *output, int width_times_channels,
 1115-                          float const *encode);
 1116-
 1117-struct stbir__info {
 1118-#ifdef STBIR_PROFILE
 1119-	union {
 1120-		struct {
 1121-			stbir_uint64 total, build, alloc, horizontal, vertical, cleanup,
 1122-			    pivot;
 1123-		} named;
 1124-		stbir_uint64 array[7];
 1125-	} profile;
 1126-	stbir_uint64 *current_zone_excluded_ptr;
 1127-#endif
 1128-	stbir__sampler horizontal;
 1129-	stbir__sampler vertical;
 1130-
 1131-	void const *input_data;
 1132-	void *output_data;
 1133-
 1134-	int input_stride_bytes;
 1135-	int output_stride_bytes;
 1136-	int ring_buffer_length_bytes; // The length of an individual entry in the
 1137-	                              // ring buffer. The total number of ring
 1138-	                              // buffers is
 1139-	                              // stbir__get_filter_pixel_width(filter)
 1140-	int ring_buffer_num_entries;  // Total number of entries in the ring buffer.
 1141-
 1142-	stbir_datatype input_type;
 1143-	stbir_datatype output_type;
 1144-
 1145-	stbir_input_callback *in_pixels_cb;
 1146-	void *user_data;
 1147-	stbir_output_callback *out_pixels_cb;
 1148-
 1149-	stbir__extents scanline_extents;
 1150-
 1151-	void *alloced_mem;
 1152-	stbir__per_split_info
 1153-	    *split_info; // by default 1, but there will be N of these allocated
 1154-	                 // based on the thread init you did
 1155-
 1156-	stbir__decode_pixels_func *decode_pixels;
 1157-	stbir__alpha_weight_func *alpha_weight;
 1158-	stbir__horizontal_gather_channels_func *horizontal_gather_channels;
 1159-	stbir__alpha_unweight_func *alpha_unweight;
 1160-	stbir__encode_pixels_func *encode_pixels;
 1161-
 1162-	int alloc_ring_buffer_num_entries; // Number of entries in the ring buffer
 1163-	                                   // that will be allocated
 1164-	int splits;                        // count of splits
 1165-
 1166-	stbir_internal_pixel_layout input_pixel_layout_internal;
 1167-	stbir_internal_pixel_layout output_pixel_layout_internal;
 1168-
 1169-	int input_color_and_type;
 1170-	int offset_x, offset_y; // offset within output_data
 1171-	int vertical_first;
 1172-	int channels;
 1173-	int effective_channels; // same as channels, except on RGBA/ARGB (7), or
 1174-	                        // XA/AX (3)
 1175-	size_t alloced_total;
 1176-};
 1177-
 1178-#define stbir__max_uint8_as_float 255.0f
 1179-#define stbir__max_uint16_as_float 65535.0f
 1180-#define stbir__max_uint8_as_float_inverted 3.9215689e-03f  // (1.0f/255.0f)
 1181-#define stbir__max_uint16_as_float_inverted 1.5259022e-05f // (1.0f/65535.0f)
 1182-#define stbir__small_float                                                     \
 1183-	((float)1 / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) /    \
 1184-	 (1 << 20))
 1185-
 1186-// min/max friendly
 1187-#define STBIR_CLAMP(x, xmin, xmax)                                             \
 1188-	for (;;) {                                                                 \
 1189-		if ((x) < (xmin))                                                      \
 1190-			(x) = (xmin);                                                      \
 1191-		if ((x) > (xmax))                                                      \
 1192-			(x) = (xmax);                                                      \
 1193-		break;                                                                 \
 1194-	}
 1195-
 1196-static stbir__inline int
 1197-stbir__min(int a, int b)
 1198-{
 1199-	return a < b ? a : b;
 1200-}
 1201-
 1202-static stbir__inline int
 1203-stbir__max(int a, int b)
 1204-{
 1205-	return a > b ? a : b;
 1206-}
 1207-
 1208-static float stbir__srgb_uchar_to_linear_float[256] = {
 1209-    0.000000f, 0.000304f, 0.000607f, 0.000911f, 0.001214f, 0.001518f, 0.001821f,
 1210-    0.002125f, 0.002428f, 0.002732f, 0.003035f, 0.003347f, 0.003677f, 0.004025f,
 1211-    0.004391f, 0.004777f, 0.005182f, 0.005605f, 0.006049f, 0.006512f, 0.006995f,
 1212-    0.007499f, 0.008023f, 0.008568f, 0.009134f, 0.009721f, 0.010330f, 0.010960f,
 1213-    0.011612f, 0.012286f, 0.012983f, 0.013702f, 0.014444f, 0.015209f, 0.015996f,
 1214-    0.016807f, 0.017642f, 0.018500f, 0.019382f, 0.020289f, 0.021219f, 0.022174f,
 1215-    0.023153f, 0.024158f, 0.025187f, 0.026241f, 0.027321f, 0.028426f, 0.029557f,
 1216-    0.030713f, 0.031896f, 0.033105f, 0.034340f, 0.035601f, 0.036889f, 0.038204f,
 1217-    0.039546f, 0.040915f, 0.042311f, 0.043735f, 0.045186f, 0.046665f, 0.048172f,
 1218-    0.049707f, 0.051269f, 0.052861f, 0.054480f, 0.056128f, 0.057805f, 0.059511f,
 1219-    0.061246f, 0.063010f, 0.064803f, 0.066626f, 0.068478f, 0.070360f, 0.072272f,
 1220-    0.074214f, 0.076185f, 0.078187f, 0.080220f, 0.082283f, 0.084376f, 0.086500f,
 1221-    0.088656f, 0.090842f, 0.093059f, 0.095307f, 0.097587f, 0.099899f, 0.102242f,
 1222-    0.104616f, 0.107023f, 0.109462f, 0.111932f, 0.114435f, 0.116971f, 0.119538f,
 1223-    0.122139f, 0.124772f, 0.127438f, 0.130136f, 0.132868f, 0.135633f, 0.138432f,
 1224-    0.141263f, 0.144128f, 0.147027f, 0.149960f, 0.152926f, 0.155926f, 0.158961f,
 1225-    0.162029f, 0.165132f, 0.168269f, 0.171441f, 0.174647f, 0.177888f, 0.181164f,
 1226-    0.184475f, 0.187821f, 0.191202f, 0.194618f, 0.198069f, 0.201556f, 0.205079f,
 1227-    0.208637f, 0.212231f, 0.215861f, 0.219526f, 0.223228f, 0.226966f, 0.230740f,
 1228-    0.234551f, 0.238398f, 0.242281f, 0.246201f, 0.250158f, 0.254152f, 0.258183f,
 1229-    0.262251f, 0.266356f, 0.270498f, 0.274677f, 0.278894f, 0.283149f, 0.287441f,
 1230-    0.291771f, 0.296138f, 0.300544f, 0.304987f, 0.309469f, 0.313989f, 0.318547f,
 1231-    0.323143f, 0.327778f, 0.332452f, 0.337164f, 0.341914f, 0.346704f, 0.351533f,
 1232-    0.356400f, 0.361307f, 0.366253f, 0.371238f, 0.376262f, 0.381326f, 0.386430f,
 1233-    0.391573f, 0.396755f, 0.401978f, 0.407240f, 0.412543f, 0.417885f, 0.423268f,
 1234-    0.428691f, 0.434154f, 0.439657f, 0.445201f, 0.450786f, 0.456411f, 0.462077f,
 1235-    0.467784f, 0.473532f, 0.479320f, 0.485150f, 0.491021f, 0.496933f, 0.502887f,
 1236-    0.508881f, 0.514918f, 0.520996f, 0.527115f, 0.533276f, 0.539480f, 0.545725f,
 1237-    0.552011f, 0.558340f, 0.564712f, 0.571125f, 0.577581f, 0.584078f, 0.590619f,
 1238-    0.597202f, 0.603827f, 0.610496f, 0.617207f, 0.623960f, 0.630757f, 0.637597f,
 1239-    0.644480f, 0.651406f, 0.658375f, 0.665387f, 0.672443f, 0.679543f, 0.686685f,
 1240-    0.693872f, 0.701102f, 0.708376f, 0.715694f, 0.723055f, 0.730461f, 0.737911f,
 1241-    0.745404f, 0.752942f, 0.760525f, 0.768151f, 0.775822f, 0.783538f, 0.791298f,
 1242-    0.799103f, 0.806952f, 0.814847f, 0.822786f, 0.830770f, 0.838799f, 0.846873f,
 1243-    0.854993f, 0.863157f, 0.871367f, 0.879622f, 0.887923f, 0.896269f, 0.904661f,
 1244-    0.913099f, 0.921582f, 0.930111f, 0.938686f, 0.947307f, 0.955974f, 0.964686f,
 1245-    0.973445f, 0.982251f, 0.991102f, 1.0f};
 1246-
 1247-typedef union {
 1248-	unsigned int u;
 1249-	float f;
 1250-} stbir__FP32;
 1251-
 1252-// From https://gist.github.com/rygorous/2203834
 1253-
 1254-static const stbir_uint32 fp32_to_srgb8_tab4[104] = {
 1255-    0x0073000d, 0x007a000d, 0x0080000d, 0x0087000d, 0x008d000d, 0x0094000d,
 1256-    0x009a000d, 0x00a1000d, 0x00a7001a, 0x00b4001a, 0x00c1001a, 0x00ce001a,
 1257-    0x00da001a, 0x00e7001a, 0x00f4001a, 0x0101001a, 0x010e0033, 0x01280033,
 1258-    0x01410033, 0x015b0033, 0x01750033, 0x018f0033, 0x01a80033, 0x01c20033,
 1259-    0x01dc0067, 0x020f0067, 0x02430067, 0x02760067, 0x02aa0067, 0x02dd0067,
 1260-    0x03110067, 0x03440067, 0x037800ce, 0x03df00ce, 0x044600ce, 0x04ad00ce,
 1261-    0x051400ce, 0x057b00c5, 0x05dd00bc, 0x063b00b5, 0x06970158, 0x07420142,
 1262-    0x07e30130, 0x087b0120, 0x090b0112, 0x09940106, 0x0a1700fc, 0x0a9500f2,
 1263-    0x0b0f01cb, 0x0bf401ae, 0x0ccb0195, 0x0d950180, 0x0e56016e, 0x0f0d015e,
 1264-    0x0fbc0150, 0x10630143, 0x11070264, 0x1238023e, 0x1357021d, 0x14660201,
 1265-    0x156601e9, 0x165a01d3, 0x174401c0, 0x182401af, 0x18fe0331, 0x1a9602fe,
 1266-    0x1c1502d2, 0x1d7e02ad, 0x1ed4028d, 0x201a0270, 0x21520256, 0x227d0240,
 1267-    0x239f0443, 0x25c003fe, 0x27bf03c4, 0x29a10392, 0x2b6a0367, 0x2d1d0341,
 1268-    0x2ebe031f, 0x304d0300, 0x31d105b0, 0x34a80555, 0x37520507, 0x39d504c5,
 1269-    0x3c37048b, 0x3e7c0458, 0x40a8042a, 0x42bd0401, 0x44c20798, 0x488e071e,
 1270-    0x4c1c06b6, 0x4f76065d, 0x52a50610, 0x55ac05cc, 0x5892058f, 0x5b590559,
 1271-    0x5e0c0a23, 0x631c0980, 0x67db08f6, 0x6c55087f, 0x70940818, 0x74a007bd,
 1272-    0x787d076c, 0x7c330723,
 1273-};
 1274-
 1275-static stbir__inline stbir_uint8
 1276-stbir__linear_to_srgb_uchar(float in)
 1277-{
 1278-	static const stbir__FP32 almostone = {0x3f7fffff}; // 1-eps
 1279-	static const stbir__FP32 minval = {(127 - 13) << 23};
 1280-	stbir_uint32 tab, bias, scale, t;
 1281-	stbir__FP32 f;
 1282-
 1283-	// Clamp to [2^(-13), 1-eps]; these two values map to 0 and 1, respectively.
 1284-	// The tests are carefully written so that NaNs map to 0, same as in the
 1285-	// reference implementation.
 1286-	if (!(in > minval.f)) { // written this way to catch NaNs
 1287-		return 0;
 1288-	}
 1289-	if (in > almostone.f) {
 1290-		return 255;
 1291-	}
 1292-
 1293-	// Do the table lookup and unpack bias, scale
 1294-	f.f = in;
 1295-	tab = fp32_to_srgb8_tab4[(f.u - minval.u) >> 20];
 1296-	bias = (tab >> 16) << 9;
 1297-	scale = tab & 0xffff;
 1298-
 1299-	// Grab next-highest mantissa bits and perform linear interpolation
 1300-	t = (f.u >> 12) & 0xff;
 1301-	return (unsigned char)((bias + scale * t) >> 16);
 1302-}
 1303-
 1304-#ifndef STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT
 1305-#define STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT                             \
 1306-	32 // when downsampling and <= 32 scanlines of buffering, use gather. gather
 1307-	   // used down to 1/8th scaling for 25% win.
 1308-#endif
 1309-
 1310-#ifndef STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS
 1311-#define STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS                               \
 1312-	4 // when threading, what is the minimum number of scanlines for a split?
 1313-#endif
 1314-
 1315-#define STBIR_INPUT_CALLBACK_PADDING 3
 1316-
 1317-#ifdef _M_IX86_FP
 1318-#if (_M_IX86_FP >= 1)
 1319-#ifndef STBIR_SSE
 1320-#define STBIR_SSE
 1321-#endif
 1322-#endif
 1323-#endif
 1324-
 1325-#ifdef __TINYC__
 1326-// tiny c has no intrinsics yet - this can become a version check if they add
 1327-// them
 1328-#define STBIR_NO_SIMD
 1329-#endif
 1330-
 1331-#if defined(_x86_64) || defined(__x86_64__) || defined(_M_X64) ||              \
 1332-    defined(__x86_64) || defined(_M_AMD64) || defined(__SSE2__) ||             \
 1333-    defined(STBIR_SSE) || defined(STBIR_SSE2)
 1334-#ifndef STBIR_SSE2
 1335-#define STBIR_SSE2
 1336-#endif
 1337-#if defined(__AVX__) || defined(STBIR_AVX2)
 1338-#ifndef STBIR_AVX
 1339-#ifndef STBIR_NO_AVX
 1340-#define STBIR_AVX
 1341-#endif
 1342-#endif
 1343-#endif
 1344-#if defined(__AVX2__) || defined(STBIR_AVX2)
 1345-#ifndef STBIR_NO_AVX2
 1346-#ifndef STBIR_AVX2
 1347-#define STBIR_AVX2
 1348-#endif
 1349-#if defined(_MSC_VER) && !defined(__clang__)
 1350-#ifndef STBIR_FP16C // FP16C instructions are on all AVX2 cpus, so we can
 1351-                    // autoselect it here on microsoft - clang needs -m16c
 1352-#define STBIR_FP16C
 1353-#endif
 1354-#endif
 1355-#endif
 1356-#endif
 1357-#ifdef __F16C__
 1358-#ifndef STBIR_FP16C // turn on FP16C instructions if the define is set (for
 1359-                    // clang and gcc)
 1360-#define STBIR_FP16C
 1361-#endif
 1362-#endif
 1363-#endif
 1364-
 1365-#if defined(_M_ARM64) || defined(__aarch64__) || defined(__arm64__) ||         \
 1366-    ((__ARM_NEON_FP & 4) != 0) || defined(__ARM_NEON__)
 1367-#ifndef STBIR_NEON
 1368-#define STBIR_NEON
 1369-#endif
 1370-#endif
 1371-
 1372-#if defined(_M_ARM) || defined(__arm__)
 1373-#ifdef STBIR_USE_FMA
 1374-#undef STBIR_USE_FMA // no FMA for 32-bit arm on MSVC
 1375-#endif
 1376-#endif
 1377-
 1378-#if defined(__wasm__) && defined(__wasm_simd128__)
 1379-#ifndef STBIR_WASM
 1380-#define STBIR_WASM
 1381-#endif
 1382-#endif
 1383-
 1384-// restrict pointers for the output pointers, other loop and unroll control
 1385-#if defined(_MSC_VER) && !defined(__clang__)
 1386-#define STBIR_STREAMOUT_PTR(star) star __restrict
 1387-#define STBIR_NO_UNROLL(ptr)                                                   \
 1388-	__assume(ptr) // this oddly keeps msvc from unrolling a loop
 1389-#if _MSC_VER >= 1900
 1390-#define STBIR_NO_UNROLL_LOOP_START __pragma(loop(no_vector))
 1391-#else
 1392-#define STBIR_NO_UNROLL_LOOP_START
 1393-#endif
 1394-#elif defined(__clang__)
 1395-#define STBIR_STREAMOUT_PTR(star) star __restrict__
 1396-#define STBIR_NO_UNROLL(ptr) __asm__("" ::"r"(ptr))
 1397-#if (__clang_major__ >= 4) || ((__clang_major__ >= 3) && (__clang_minor__ >= 5))
 1398-#define STBIR_NO_UNROLL_LOOP_START                                             \
 1399-	_Pragma("clang loop unroll(disable)")                                      \
 1400-	    _Pragma("clang loop vectorize(disable)")
 1401-#else
 1402-#define STBIR_NO_UNROLL_LOOP_START
 1403-#endif
 1404-#elif defined(__GNUC__)
 1405-#define STBIR_STREAMOUT_PTR(star) star __restrict__
 1406-#define STBIR_NO_UNROLL(ptr) __asm__("" ::"r"(ptr))
 1407-#if __GNUC__ >= 14
 1408-#define STBIR_NO_UNROLL_LOOP_START                                             \
 1409-	_Pragma("GCC unroll 0") _Pragma("GCC novector")
 1410-#else
 1411-#define STBIR_NO_UNROLL_LOOP_START
 1412-#endif
 1413-#define STBIR_NO_UNROLL_LOOP_START_INF_FOR
 1414-#else
 1415-#define STBIR_STREAMOUT_PTR(star) star
 1416-#define STBIR_NO_UNROLL(ptr)
 1417-#define STBIR_NO_UNROLL_LOOP_START
 1418-#endif
 1419-
 1420-#ifndef STBIR_NO_UNROLL_LOOP_START_INF_FOR
 1421-#define STBIR_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START
 1422-#endif
 1423-
 1424-#ifdef STBIR_NO_SIMD // force simd off for whatever reason
 1425-
 1426-// force simd off overrides everything else, so clear it all
 1427-
 1428-#ifdef STBIR_SSE2
 1429-#undef STBIR_SSE2
 1430-#endif
 1431-
 1432-#ifdef STBIR_AVX
 1433-#undef STBIR_AVX
 1434-#endif
 1435-
 1436-#ifdef STBIR_NEON
 1437-#undef STBIR_NEON
 1438-#endif
 1439-
 1440-#ifdef STBIR_AVX2
 1441-#undef STBIR_AVX2
 1442-#endif
 1443-
 1444-#ifdef STBIR_FP16C
 1445-#undef STBIR_FP16C
 1446-#endif
 1447-
 1448-#ifdef STBIR_WASM
 1449-#undef STBIR_WASM
 1450-#endif
 1451-
 1452-#ifdef STBIR_SIMD
 1453-#undef STBIR_SIMD
 1454-#endif
 1455-
 1456-#else // STBIR_SIMD
 1457-
 1458-#ifdef STBIR_SSE2
 1459-#include <emmintrin.h>
 1460-
 1461-#define stbir__simdf __m128
 1462-#define stbir__simdi __m128i
 1463-
 1464-#define stbir_simdi_castf(reg) _mm_castps_si128(reg)
 1465-#define stbir_simdf_casti(reg) _mm_castsi128_ps(reg)
 1466-
 1467-#define stbir__simdf_load(reg, ptr) (reg) = _mm_loadu_ps((float const *)(ptr))
 1468-#define stbir__simdi_load(reg, ptr)                                            \
 1469-	(reg) = _mm_loadu_si128((stbir__simdi const *)(ptr))
 1470-#define stbir__simdf_load1(out, ptr)                                           \
 1471-	(out) = _mm_load_ss((float const *)(ptr)) // top values can be random (not
 1472-	                                          // denormal or nan for perf)
 1473-#define stbir__simdi_load1(out, ptr)                                           \
 1474-	(out) = _mm_castps_si128(_mm_load_ss((float const *)(ptr)))
 1475-#define stbir__simdf_load1z(out, ptr)                                          \
 1476-	(out) = _mm_load_ss((float const *)(ptr)) // top values must be zero
 1477-#define stbir__simdf_frep4(fvar) _mm_set_ps1(fvar)
 1478-#define stbir__simdf_load1frep4(out, fvar) (out) = _mm_set_ps1(fvar)
 1479-#define stbir__simdf_load2(out, ptr)                                           \
 1480-	(out) = _mm_castsi128_ps(                                                  \
 1481-	    _mm_loadl_epi64((__m128i *)(ptr))) // top values can be random (not
 1482-	                                       // denormal or nan for perf)
 1483-#define stbir__simdf_load2z(out, ptr)                                          \
 1484-	(out) = _mm_castsi128_ps(                                                  \
 1485-	    _mm_loadl_epi64((__m128i *)(ptr))) // top values must be zero
 1486-#define stbir__simdf_load2hmerge(out, reg, ptr)                                \
 1487-	(out) = _mm_castpd_ps(_mm_loadh_pd(_mm_castps_pd(reg), (double *)(ptr)))
 1488-
 1489-#define stbir__simdf_zeroP() _mm_setzero_ps()
 1490-#define stbir__simdf_zero(reg) (reg) = _mm_setzero_ps()
 1491-
 1492-#define stbir__simdf_store(ptr, reg) _mm_storeu_ps((float *)(ptr), reg)
 1493-#define stbir__simdf_store1(ptr, reg) _mm_store_ss((float *)(ptr), reg)
 1494-#define stbir__simdf_store2(ptr, reg)                                          \
 1495-	_mm_storel_epi64((__m128i *)(ptr), _mm_castps_si128(reg))
 1496-#define stbir__simdf_store2h(ptr, reg)                                         \
 1497-	_mm_storeh_pd((double *)(ptr), _mm_castps_pd(reg))
 1498-
 1499-#define stbir__simdi_store(ptr, reg) _mm_storeu_si128((__m128i *)(ptr), reg)
 1500-#define stbir__simdi_store1(ptr, reg)                                          \
 1501-	_mm_store_ss((float *)(ptr), _mm_castsi128_ps(reg))
 1502-#define stbir__simdi_store2(ptr, reg) _mm_storel_epi64((__m128i *)(ptr), (reg))
 1503-
 1504-#define stbir__prefetch(ptr) _mm_prefetch((char *)(ptr), _MM_HINT_T0)
 1505-
 1506-#define stbir__simdi_expand_u8_to_u32(out0, out1, out2, out3, ireg)            \
 1507-	{                                                                          \
 1508-		stbir__simdi zero = _mm_setzero_si128();                               \
 1509-		out2 = _mm_unpacklo_epi8(ireg, zero);                                  \
 1510-		out3 = _mm_unpackhi_epi8(ireg, zero);                                  \
 1511-		out0 = _mm_unpacklo_epi16(out2, zero);                                 \
 1512-		out1 = _mm_unpackhi_epi16(out2, zero);                                 \
 1513-		out2 = _mm_unpacklo_epi16(out3, zero);                                 \
 1514-		out3 = _mm_unpackhi_epi16(out3, zero);                                 \
 1515-	}
 1516-
 1517-#define stbir__simdi_expand_u8_to_1u32(out, ireg)                              \
 1518-	{                                                                          \
 1519-		stbir__simdi zero = _mm_setzero_si128();                               \
 1520-		out = _mm_unpacklo_epi8(ireg, zero);                                   \
 1521-		out = _mm_unpacklo_epi16(out, zero);                                   \
 1522-	}
 1523-
 1524-#define stbir__simdi_expand_u16_to_u32(out0, out1, ireg)                       \
 1525-	{                                                                          \
 1526-		stbir__simdi zero = _mm_setzero_si128();                               \
 1527-		out0 = _mm_unpacklo_epi16(ireg, zero);                                 \
 1528-		out1 = _mm_unpackhi_epi16(ireg, zero);                                 \
 1529-	}
 1530-
 1531-#define stbir__simdf_convert_float_to_i32(i, f) (i) = _mm_cvttps_epi32(f)
 1532-#define stbir__simdf_convert_float_to_int(f) _mm_cvtt_ss2si(f)
 1533-#define stbir__simdf_convert_float_to_uint8(f)                                 \
 1534-	((unsigned char)_mm_cvtsi128_si32(_mm_cvttps_epi32(                        \
 1535-	    _mm_max_ps(_mm_min_ps(f, STBIR__CONSTF(STBIR_max_uint8_as_float)),     \
 1536-	               _mm_setzero_ps()))))
 1537-#define stbir__simdf_convert_float_to_short(f)                                 \
 1538-	((unsigned short)_mm_cvtsi128_si32(_mm_cvttps_epi32(                       \
 1539-	    _mm_max_ps(_mm_min_ps(f, STBIR__CONSTF(STBIR_max_uint16_as_float)),    \
 1540-	               _mm_setzero_ps()))))
 1541-
 1542-#define stbir__simdi_to_int(i) _mm_cvtsi128_si32(i)
 1543-#define stbir__simdi_convert_i32_to_float(out, ireg)                           \
 1544-	(out) = _mm_cvtepi32_ps(ireg)
 1545-#define stbir__simdf_add(out, reg0, reg1) (out) = _mm_add_ps(reg0, reg1)
 1546-#define stbir__simdf_mult(out, reg0, reg1) (out) = _mm_mul_ps(reg0, reg1)
 1547-#define stbir__simdf_mult_mem(out, reg, ptr)                                   \
 1548-	(out) = _mm_mul_ps(reg, _mm_loadu_ps((float const *)(ptr)))
 1549-#define stbir__simdf_mult1_mem(out, reg, ptr)                                  \
 1550-	(out) = _mm_mul_ss(reg, _mm_load_ss((float const *)(ptr)))
 1551-#define stbir__simdf_add_mem(out, reg, ptr)                                    \
 1552-	(out) = _mm_add_ps(reg, _mm_loadu_ps((float const *)(ptr)))
 1553-#define stbir__simdf_add1_mem(out, reg, ptr)                                   \
 1554-	(out) = _mm_add_ss(reg, _mm_load_ss((float const *)(ptr)))
 1555-
 1556-#ifdef STBIR_USE_FMA // not on by default to maintain bit identical simd to
 1557-                     // non-simd
 1558-#include <immintrin.h>
 1559-#define stbir__simdf_madd(out, add, mul1, mul2)                                \
 1560-	(out) = _mm_fmadd_ps(mul1, mul2, add)
 1561-#define stbir__simdf_madd1(out, add, mul1, mul2)                               \
 1562-	(out) = _mm_fmadd_ss(mul1, mul2, add)
 1563-#define stbir__simdf_madd_mem(out, add, mul, ptr)                              \
 1564-	(out) = _mm_fmadd_ps(mul, _mm_loadu_ps((float const *)(ptr)), add)
 1565-#define stbir__simdf_madd1_mem(out, add, mul, ptr)                             \
 1566-	(out) = _mm_fmadd_ss(mul, _mm_load_ss((float const *)(ptr)), add)
 1567-#else
 1568-#define stbir__simdf_madd(out, add, mul1, mul2)                                \
 1569-	(out) = _mm_add_ps(add, _mm_mul_ps(mul1, mul2))
 1570-#define stbir__simdf_madd1(out, add, mul1, mul2)                               \
 1571-	(out) = _mm_add_ss(add, _mm_mul_ss(mul1, mul2))
 1572-#define stbir__simdf_madd_mem(out, add, mul, ptr)                              \
 1573-	(out) = _mm_add_ps(add, _mm_mul_ps(mul, _mm_loadu_ps((float const *)(ptr))))
 1574-#define stbir__simdf_madd1_mem(out, add, mul, ptr)                             \
 1575-	(out) = _mm_add_ss(add, _mm_mul_ss(mul, _mm_load_ss((float const *)(ptr))))
 1576-#endif
 1577-
 1578-#define stbir__simdf_add1(out, reg0, reg1) (out) = _mm_add_ss(reg0, reg1)
 1579-#define stbir__simdf_mult1(out, reg0, reg1) (out) = _mm_mul_ss(reg0, reg1)
 1580-
 1581-#define stbir__simdf_and(out, reg0, reg1) (out) = _mm_and_ps(reg0, reg1)
 1582-#define stbir__simdf_or(out, reg0, reg1) (out) = _mm_or_ps(reg0, reg1)
 1583-
 1584-#define stbir__simdf_min(out, reg0, reg1) (out) = _mm_min_ps(reg0, reg1)
 1585-#define stbir__simdf_max(out, reg0, reg1) (out) = _mm_max_ps(reg0, reg1)
 1586-#define stbir__simdf_min1(out, reg0, reg1) (out) = _mm_min_ss(reg0, reg1)
 1587-#define stbir__simdf_max1(out, reg0, reg1) (out) = _mm_max_ss(reg0, reg1)
 1588-
 1589-#define stbir__simdf_0123ABCDto3ABx(out, reg0, reg1)                           \
 1590-	(out) = _mm_castsi128_ps(_mm_shuffle_epi32(                                \
 1591-	    _mm_castps_si128(_mm_shuffle_ps(                                       \
 1592-	        reg1, reg0, (0 << 0) + (1 << 2) + (2 << 4) + (3 << 6))),           \
 1593-	    (3 << 0) + (0 << 2) + (1 << 4) + (2 << 6)))
 1594-#define stbir__simdf_0123ABCDto23Ax(out, reg0, reg1)                           \
 1595-	(out) = _mm_castsi128_ps(_mm_shuffle_epi32(                                \
 1596-	    _mm_castps_si128(_mm_shuffle_ps(                                       \
 1597-	        reg1, reg0, (0 << 0) + (1 << 2) + (2 << 4) + (3 << 6))),           \
 1598-	    (2 << 0) + (3 << 2) + (0 << 4) + (1 << 6)))
 1599-
 1600-static const stbir__simdf STBIR_zeroones = {0.0f, 1.0f, 0.0f, 1.0f};
 1601-static const stbir__simdf STBIR_onezeros = {1.0f, 0.0f, 1.0f, 0.0f};
 1602-#define stbir__simdf_aaa1(out, alp, ones)                                      \
 1603-	(out) = _mm_castsi128_ps(                                                  \
 1604-	    _mm_shuffle_epi32(_mm_castps_si128(_mm_movehl_ps(ones, alp)),          \
 1605-	                      (1 << 0) + (1 << 2) + (1 << 4) + (2 << 6)))
 1606-#define stbir__simdf_1aaa(out, alp, ones)                                      \
 1607-	(out) = _mm_castsi128_ps(                                                  \
 1608-	    _mm_shuffle_epi32(_mm_castps_si128(_mm_movelh_ps(ones, alp)),          \
 1609-	                      (0 << 0) + (2 << 2) + (2 << 4) + (2 << 6)))
 1610-#define stbir__simdf_a1a1(out, alp, ones)                                      \
 1611-	(out) =                                                                    \
 1612-	    _mm_or_ps(_mm_castsi128_ps(_mm_srli_epi64(_mm_castps_si128(alp), 32)), \
 1613-	              STBIR_zeroones)
 1614-#define stbir__simdf_1a1a(out, alp, ones)                                      \
 1615-	(out) =                                                                    \
 1616-	    _mm_or_ps(_mm_castsi128_ps(_mm_slli_epi64(_mm_castps_si128(alp), 32)), \
 1617-	              STBIR_onezeros)
 1618-
 1619-#define stbir__simdf_swiz(reg, one, two, three, four)                          \
 1620-	_mm_castsi128_ps(_mm_shuffle_epi32(_mm_castps_si128(reg),                  \
 1621-	                                   (one << 0) + (two << 2) +               \
 1622-	                                       (three << 4) + (four << 6)))
 1623-
 1624-#define stbir__simdi_and(out, reg0, reg1) (out) = _mm_and_si128(reg0, reg1)
 1625-#define stbir__simdi_or(out, reg0, reg1) (out) = _mm_or_si128(reg0, reg1)
 1626-#define stbir__simdi_16madd(out, reg0, reg1) (out) = _mm_madd_epi16(reg0, reg1)
 1627-
 1628-#define stbir__simdf_pack_to_8bytes(out, aa, bb)                               \
 1629-	{                                                                          \
 1630-		stbir__simdf af, bf;                                                   \
 1631-		stbir__simdi a, b;                                                     \
 1632-		af = _mm_min_ps(aa, STBIR_max_uint8_as_float);                         \
 1633-		bf = _mm_min_ps(bb, STBIR_max_uint8_as_float);                         \
 1634-		af = _mm_max_ps(af, _mm_setzero_ps());                                 \
 1635-		bf = _mm_max_ps(bf, _mm_setzero_ps());                                 \
 1636-		a = _mm_cvttps_epi32(af);                                              \
 1637-		b = _mm_cvttps_epi32(bf);                                              \
 1638-		a = _mm_packs_epi32(a, b);                                             \
 1639-		out = _mm_packus_epi16(a, a);                                          \
 1640-	}
 1641-
 1642-#define stbir__simdf_load4_transposed(o0, o1, o2, o3, ptr)                     \
 1643-	stbir__simdf_load(o0, (ptr));                                              \
 1644-	stbir__simdf_load(o1, (ptr) + 4);                                          \
 1645-	stbir__simdf_load(o2, (ptr) + 8);                                          \
 1646-	stbir__simdf_load(o3, (ptr) + 12);                                         \
 1647-	{                                                                          \
 1648-		__m128 tmp0, tmp1, tmp2, tmp3;                                         \
 1649-		tmp0 = _mm_unpacklo_ps(o0, o1);                                        \
 1650-		tmp2 = _mm_unpacklo_ps(o2, o3);                                        \
 1651-		tmp1 = _mm_unpackhi_ps(o0, o1);                                        \
 1652-		tmp3 = _mm_unpackhi_ps(o2, o3);                                        \
 1653-		o0 = _mm_movelh_ps(tmp0, tmp2);                                        \
 1654-		o1 = _mm_movehl_ps(tmp2, tmp0);                                        \
 1655-		o2 = _mm_movelh_ps(tmp1, tmp3);                                        \
 1656-		o3 = _mm_movehl_ps(tmp3, tmp1);                                        \
 1657-	}
 1658-
 1659-#define stbir__interleave_pack_and_store_16_u8(ptr, r0, r1, r2, r3)            \
 1660-	r0 = _mm_packs_epi32(r0, r1);                                              \
 1661-	r2 = _mm_packs_epi32(r2, r3);                                              \
 1662-	r1 = _mm_unpacklo_epi16(r0, r2);                                           \
 1663-	r3 = _mm_unpackhi_epi16(r0, r2);                                           \
 1664-	r0 = _mm_unpacklo_epi16(r1, r3);                                           \
 1665-	r2 = _mm_unpackhi_epi16(r1, r3);                                           \
 1666-	r0 = _mm_packus_epi16(r0, r2);                                             \
 1667-	stbir__simdi_store(ptr, r0);
 1668-
 1669-#define stbir__simdi_32shr(out, reg, imm) out = _mm_srli_epi32(reg, imm)
 1670-
 1671-#if defined(_MSC_VER) && !defined(__clang__)
 1672-// msvc inits with 8 bytes
 1673-#define STBIR__CONST_32_TO_8(v)                                                \
 1674-	(char)(unsigned char)((v) & 255), (char)(unsigned char)(((v) >> 8) & 255), \
 1675-	    (char)(unsigned char)(((v) >> 16) & 255),                              \
 1676-	    (char)(unsigned char)(((v) >> 24) & 255)
 1677-#define STBIR__CONST_4_32i(v)                                                  \
 1678-	STBIR__CONST_32_TO_8(v), STBIR__CONST_32_TO_8(v), STBIR__CONST_32_TO_8(v), \
 1679-	    STBIR__CONST_32_TO_8(v)
 1680-#define STBIR__CONST_4d_32i(v0, v1, v2, v3)                                    \
 1681-	STBIR__CONST_32_TO_8(v0), STBIR__CONST_32_TO_8(v1),                        \
 1682-	    STBIR__CONST_32_TO_8(v2), STBIR__CONST_32_TO_8(v3)
 1683-#else
 1684-// everything else inits with long long's
 1685-#define STBIR__CONST_4_32i(v)                                                  \
 1686-	(long long)((((stbir_uint64)(stbir_uint32)(v)) << 32) |                    \
 1687-	            ((stbir_uint64)(stbir_uint32)(v))),                            \
 1688-	    (long long)((((stbir_uint64)(stbir_uint32)(v)) << 32) |                \
 1689-	                ((stbir_uint64)(stbir_uint32)(v)))
 1690-#define STBIR__CONST_4d_32i(v0, v1, v2, v3)                                    \
 1691-	(long long)((((stbir_uint64)(stbir_uint32)(v1)) << 32) |                   \
 1692-	            ((stbir_uint64)(stbir_uint32)(v0))),                           \
 1693-	    (long long)((((stbir_uint64)(stbir_uint32)(v3)) << 32) |               \
 1694-	                ((stbir_uint64)(stbir_uint32)(v2)))
 1695-#endif
 1696-
 1697-#define STBIR__SIMDF_CONST(var, x) stbir__simdf var = {x, x, x, x}
 1698-#define STBIR__SIMDI_CONST(var, x) stbir__simdi var = {STBIR__CONST_4_32i(x)}
 1699-#define STBIR__CONSTF(var) (var)
 1700-#define STBIR__CONSTI(var) (var)
 1701-
 1702-#if defined(STBIR_AVX) || defined(__SSE4_1__)
 1703-#include <smmintrin.h>
 1704-#define stbir__simdf_pack_to_8words(out, reg0, reg1)                           \
 1705-	out = _mm_packus_epi32(                                                    \
 1706-	    _mm_cvttps_epi32(_mm_max_ps(                                           \
 1707-	        _mm_min_ps(reg0, STBIR__CONSTF(STBIR_max_uint16_as_float)),        \
 1708-	        _mm_setzero_ps())),                                                \
 1709-	    _mm_cvttps_epi32(_mm_max_ps(                                           \
 1710-	        _mm_min_ps(reg1, STBIR__CONSTF(STBIR_max_uint16_as_float)),        \
 1711-	        _mm_setzero_ps())))
 1712-#else
 1713-static STBIR__SIMDI_CONST(stbir__s32_32768, 32768);
 1714-static STBIR__SIMDI_CONST(stbir__s16_32768, ((32768 << 16) | 32768));
 1715-
 1716-#define stbir__simdf_pack_to_8words(out, reg0, reg1)                           \
 1717-	{                                                                          \
 1718-		stbir__simdi tmp0, tmp1;                                               \
 1719-		tmp0 = _mm_cvttps_epi32(_mm_max_ps(                                    \
 1720-		    _mm_min_ps(reg0, STBIR__CONSTF(STBIR_max_uint16_as_float)),        \
 1721-		    _mm_setzero_ps()));                                                \
 1722-		tmp1 = _mm_cvttps_epi32(_mm_max_ps(                                    \
 1723-		    _mm_min_ps(reg1, STBIR__CONSTF(STBIR_max_uint16_as_float)),        \
 1724-		    _mm_setzero_ps()));                                                \
 1725-		tmp0 = _mm_sub_epi32(tmp0, stbir__s32_32768);                          \
 1726-		tmp1 = _mm_sub_epi32(tmp1, stbir__s32_32768);                          \
 1727-		out = _mm_packs_epi32(tmp0, tmp1);                                     \
 1728-		out = _mm_sub_epi16(out, stbir__s16_32768);                            \
 1729-	}
 1730-
 1731-#endif
 1732-
 1733-#define STBIR_SIMD
 1734-
 1735-// if we detect AVX, set the simd8 defines
 1736-#ifdef STBIR_AVX
 1737-#include <immintrin.h>
 1738-#define STBIR_SIMD8
 1739-#define stbir__simdf8 __m256
 1740-#define stbir__simdi8 __m256i
 1741-#define stbir__simdf8_load(out, ptr)                                           \
 1742-	(out) = _mm256_loadu_ps((float const *)(ptr))
 1743-#define stbir__simdi8_load(out, ptr)                                           \
 1744-	(out) = _mm256_loadu_si256((__m256i const *)(ptr))
 1745-#define stbir__simdf8_mult(out, a, b) (out) = _mm256_mul_ps((a), (b))
 1746-#define stbir__simdf8_store(ptr, out) _mm256_storeu_ps((float *)(ptr), out)
 1747-#define stbir__simdi8_store(ptr, reg) _mm256_storeu_si256((__m256i *)(ptr), reg)
 1748-#define stbir__simdf8_frep8(fval) _mm256_set1_ps(fval)
 1749-
 1750-#define stbir__simdf8_min(out, reg0, reg1) (out) = _mm256_min_ps(reg0, reg1)
 1751-#define stbir__simdf8_max(out, reg0, reg1) (out) = _mm256_max_ps(reg0, reg1)
 1752-
 1753-#define stbir__simdf8_add4halves(out, bot4, top8)                              \
 1754-	(out) = _mm_add_ps(bot4, _mm256_extractf128_ps(top8, 1))
 1755-#define stbir__simdf8_mult_mem(out, reg, ptr)                                  \
 1756-	(out) = _mm256_mul_ps(reg, _mm256_loadu_ps((float const *)(ptr)))
 1757-#define stbir__simdf8_add_mem(out, reg, ptr)                                   \
 1758-	(out) = _mm256_add_ps(reg, _mm256_loadu_ps((float const *)(ptr)))
 1759-#define stbir__simdf8_add(out, a, b) (out) = _mm256_add_ps(a, b)
 1760-#define stbir__simdf8_load1b(out, ptr) (out) = _mm256_broadcast_ss(ptr)
 1761-#define stbir__simdf_load1rep4(out, ptr)                                       \
 1762-	(out) = _mm_broadcast_ss(ptr) // avx load instruction
 1763-
 1764-#define stbir__simdi8_convert_i32_to_float(out, ireg)                          \
 1765-	(out) = _mm256_cvtepi32_ps(ireg)
 1766-#define stbir__simdf8_convert_float_to_i32(i, f) (i) = _mm256_cvttps_epi32(f)
 1767-
 1768-#define stbir__simdf8_bot4s(out, a, b)                                         \
 1769-	(out) = _mm256_permute2f128_ps(a, b, (0 << 0) + (2 << 4))
 1770-#define stbir__simdf8_top4s(out, a, b)                                         \
 1771-	(out) = _mm256_permute2f128_ps(a, b, (1 << 0) + (3 << 4))
 1772-
 1773-#define stbir__simdf8_gettop4(reg) _mm256_extractf128_ps(reg, 1)
 1774-
 1775-#ifdef STBIR_AVX2
 1776-
 1777-#define stbir__simdi8_expand_u8_to_u32(out0, out1, ireg)                       \
 1778-	{                                                                          \
 1779-		stbir__simdi8 a, zero = _mm256_setzero_si256();                        \
 1780-		a = _mm256_permute4x64_epi64(                                          \
 1781-		    _mm256_unpacklo_epi8(                                              \
 1782-		        _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),         \
 1783-		                                 (0 << 0) + (2 << 2) + (1 << 4) +      \
 1784-		                                     (3 << 6)),                        \
 1785-		        zero),                                                         \
 1786-		    (0 << 0) + (2 << 2) + (1 << 4) + (3 << 6));                        \
 1787-		out0 = _mm256_unpacklo_epi16(a, zero);                                 \
 1788-		out1 = _mm256_unpackhi_epi16(a, zero);                                 \
 1789-	}
 1790-
 1791-#define stbir__simdf8_pack_to_16bytes(out, aa, bb)                             \
 1792-	{                                                                          \
 1793-		stbir__simdi8 t;                                                       \
 1794-		stbir__simdf8 af, bf;                                                  \
 1795-		stbir__simdi8 a, b;                                                    \
 1796-		af = _mm256_min_ps(aa, STBIR_max_uint8_as_floatX);                     \
 1797-		bf = _mm256_min_ps(bb, STBIR_max_uint8_as_floatX);                     \
 1798-		af = _mm256_max_ps(af, _mm256_setzero_ps());                           \
 1799-		bf = _mm256_max_ps(bf, _mm256_setzero_ps());                           \
 1800-		a = _mm256_cvttps_epi32(af);                                           \
 1801-		b = _mm256_cvttps_epi32(bf);                                           \
 1802-		t = _mm256_permute4x64_epi64(_mm256_packs_epi32(a, b),                 \
 1803-		                             (0 << 0) + (2 << 2) + (1 << 4) +          \
 1804-		                                 (3 << 6));                            \
 1805-		out = _mm256_castsi256_si128(_mm256_permute4x64_epi64(                 \
 1806-		    _mm256_packus_epi16(t, t),                                         \
 1807-		    (0 << 0) + (2 << 2) + (1 << 4) + (3 << 6)));                       \
 1808-	}
 1809-
 1810-#define stbir__simdi8_expand_u16_to_u32(out, ireg)                             \
 1811-	out = _mm256_unpacklo_epi16(                                               \
 1812-	    _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),                 \
 1813-	                             (0 << 0) + (2 << 2) + (1 << 4) + (3 << 6)),   \
 1814-	    _mm256_setzero_si256());
 1815-
 1816-#define stbir__simdf8_pack_to_16words(out, aa, bb)                             \
 1817-	{                                                                          \
 1818-		stbir__simdf8 af, bf;                                                  \
 1819-		stbir__simdi8 a, b;                                                    \
 1820-		af = _mm256_min_ps(aa, STBIR_max_uint16_as_floatX);                    \
 1821-		bf = _mm256_min_ps(bb, STBIR_max_uint16_as_floatX);                    \
 1822-		af = _mm256_max_ps(af, _mm256_setzero_ps());                           \
 1823-		bf = _mm256_max_ps(bf, _mm256_setzero_ps());                           \
 1824-		a = _mm256_cvttps_epi32(af);                                           \
 1825-		b = _mm256_cvttps_epi32(bf);                                           \
 1826-		(out) = _mm256_permute4x64_epi64(_mm256_packus_epi32(a, b),            \
 1827-		                                 (0 << 0) + (2 << 2) + (1 << 4) +      \
 1828-		                                     (3 << 6));                        \
 1829-	}
 1830-
 1831-#else
 1832-
 1833-#define stbir__simdi8_expand_u8_to_u32(out0, out1, ireg)                       \
 1834-	{                                                                          \
 1835-		stbir__simdi a, zero = _mm_setzero_si128();                            \
 1836-		a = _mm_unpacklo_epi8(ireg, zero);                                     \
 1837-		out0 = _mm256_setr_m128i(_mm_unpacklo_epi16(a, zero),                  \
 1838-		                         _mm_unpackhi_epi16(a, zero));                 \
 1839-		a = _mm_unpackhi_epi8(ireg, zero);                                     \
 1840-		out1 = _mm256_setr_m128i(_mm_unpacklo_epi16(a, zero),                  \
 1841-		                         _mm_unpackhi_epi16(a, zero));                 \
 1842-	}
 1843-
 1844-#define stbir__simdf8_pack_to_16bytes(out, aa, bb)                             \
 1845-	{                                                                          \
 1846-		stbir__simdi t;                                                        \
 1847-		stbir__simdf8 af, bf;                                                  \
 1848-		stbir__simdi8 a, b;                                                    \
 1849-		af = _mm256_min_ps(aa, STBIR_max_uint8_as_floatX);                     \
 1850-		bf = _mm256_min_ps(bb, STBIR_max_uint8_as_floatX);                     \
 1851-		af = _mm256_max_ps(af, _mm256_setzero_ps());                           \
 1852-		bf = _mm256_max_ps(bf, _mm256_setzero_ps());                           \
 1853-		a = _mm256_cvttps_epi32(af);                                           \
 1854-		b = _mm256_cvttps_epi32(bf);                                           \
 1855-		out = _mm_packs_epi32(_mm256_castsi256_si128(a),                       \
 1856-		                      _mm256_extractf128_si256(a, 1));                 \
 1857-		out = _mm_packus_epi16(out, out);                                      \
 1858-		t = _mm_packs_epi32(_mm256_castsi256_si128(b),                         \
 1859-		                    _mm256_extractf128_si256(b, 1));                   \
 1860-		t = _mm_packus_epi16(t, t);                                            \
 1861-		out = _mm_castps_si128(                                                \
 1862-		    _mm_shuffle_ps(_mm_castsi128_ps(out), _mm_castsi128_ps(t),         \
 1863-		                   (0 << 0) + (1 << 2) + (0 << 4) + (1 << 6)));        \
 1864-	}
 1865-
 1866-#define stbir__simdi8_expand_u16_to_u32(out, ireg)                             \
 1867-	{                                                                          \
 1868-		stbir__simdi a, b, zero = _mm_setzero_si128();                         \
 1869-		a = _mm_unpacklo_epi16(ireg, zero);                                    \
 1870-		b = _mm_unpackhi_epi16(ireg, zero);                                    \
 1871-		out = _mm256_insertf128_si256(_mm256_castsi128_si256(a), b, 1);        \
 1872-	}
 1873-
 1874-#define stbir__simdf8_pack_to_16words(out, aa, bb)                             \
 1875-	{                                                                          \
 1876-		stbir__simdi t0, t1;                                                   \
 1877-		stbir__simdf8 af, bf;                                                  \
 1878-		stbir__simdi8 a, b;                                                    \
 1879-		af = _mm256_min_ps(aa, STBIR_max_uint16_as_floatX);                    \
 1880-		bf = _mm256_min_ps(bb, STBIR_max_uint16_as_floatX);                    \
 1881-		af = _mm256_max_ps(af, _mm256_setzero_ps());                           \
 1882-		bf = _mm256_max_ps(bf, _mm256_setzero_ps());                           \
 1883-		a = _mm256_cvttps_epi32(af);                                           \
 1884-		b = _mm256_cvttps_epi32(bf);                                           \
 1885-		t0 = _mm_packus_epi32(_mm256_castsi256_si128(a),                       \
 1886-		                      _mm256_extractf128_si256(a, 1));                 \
 1887-		t1 = _mm_packus_epi32(_mm256_castsi256_si128(b),                       \
 1888-		                      _mm256_extractf128_si256(b, 1));                 \
 1889-		out = _mm256_setr_m128i(t0, t1);                                       \
 1890-	}
 1891-
 1892-#endif
 1893-
 1894-static __m256i stbir_00001111 = {STBIR__CONST_4d_32i(0, 0, 0, 0),
 1895-                                 STBIR__CONST_4d_32i(1, 1, 1, 1)};
 1896-#define stbir__simdf8_0123to00001111(out, in)                                  \
 1897-	(out) = _mm256_permutevar_ps(in, stbir_00001111)
 1898-
 1899-static __m256i stbir_22223333 = {STBIR__CONST_4d_32i(2, 2, 2, 2),
 1900-                                 STBIR__CONST_4d_32i(3, 3, 3, 3)};
 1901-#define stbir__simdf8_0123to22223333(out, in)                                  \
 1902-	(out) = _mm256_permutevar_ps(in, stbir_22223333)
 1903-
 1904-#define stbir__simdf8_0123to2222(out, in)                                      \
 1905-	(out) = stbir__simdf_swiz(_mm256_castps256_ps128(in), 2, 2, 2, 2)
 1906-
 1907-#define stbir__simdf8_load4b(out, ptr)                                         \
 1908-	(out) = _mm256_broadcast_ps((__m128 const *)(ptr))
 1909-
 1910-static __m256i stbir_00112233 = {STBIR__CONST_4d_32i(0, 0, 1, 1),
 1911-                                 STBIR__CONST_4d_32i(2, 2, 3, 3)};
 1912-#define stbir__simdf8_0123to00112233(out, in)                                  \
 1913-	(out) = _mm256_permutevar_ps(in, stbir_00112233)
 1914-#define stbir__simdf8_add4(out, a8, b)                                         \
 1915-	(out) = _mm256_add_ps(a8, _mm256_castps128_ps256(b))
 1916-
 1917-static __m256i stbir_load6 = {
 1918-    STBIR__CONST_4_32i(0x80000000),
 1919-    STBIR__CONST_4d_32i(0x80000000, 0x80000000, 0, 0)};
 1920-#define stbir__simdf8_load6z(out, ptr)                                         \
 1921-	(out) = _mm256_maskload_ps(ptr, stbir_load6)
 1922-
 1923-#define stbir__simdf8_0123to00000000(out, in)                                  \
 1924-	(out) = _mm256_shuffle_ps(in, in, (0 << 0) + (0 << 2) + (0 << 4) + (0 << 6))
 1925-#define stbir__simdf8_0123to11111111(out, in)                                  \
 1926-	(out) = _mm256_shuffle_ps(in, in, (1 << 0) + (1 << 2) + (1 << 4) + (1 << 6))
 1927-#define stbir__simdf8_0123to22222222(out, in)                                  \
 1928-	(out) = _mm256_shuffle_ps(in, in, (2 << 0) + (2 << 2) + (2 << 4) + (2 << 6))
 1929-#define stbir__simdf8_0123to33333333(out, in)                                  \
 1930-	(out) = _mm256_shuffle_ps(in, in, (3 << 0) + (3 << 2) + (3 << 4) + (3 << 6))
 1931-#define stbir__simdf8_0123to21032103(out, in)                                  \
 1932-	(out) = _mm256_shuffle_ps(in, in, (2 << 0) + (1 << 2) + (0 << 4) + (3 << 6))
 1933-#define stbir__simdf8_0123to32103210(out, in)                                  \
 1934-	(out) = _mm256_shuffle_ps(in, in, (3 << 0) + (2 << 2) + (1 << 4) + (0 << 6))
 1935-#define stbir__simdf8_0123to12301230(out, in)                                  \
 1936-	(out) = _mm256_shuffle_ps(in, in, (1 << 0) + (2 << 2) + (3 << 4) + (0 << 6))
 1937-#define stbir__simdf8_0123to10321032(out, in)                                  \
 1938-	(out) = _mm256_shuffle_ps(in, in, (1 << 0) + (0 << 2) + (3 << 4) + (2 << 6))
 1939-#define stbir__simdf8_0123to30123012(out, in)                                  \
 1940-	(out) = _mm256_shuffle_ps(in, in, (3 << 0) + (0 << 2) + (1 << 4) + (2 << 6))
 1941-
 1942-#define stbir__simdf8_0123to11331133(out, in)                                  \
 1943-	(out) = _mm256_shuffle_ps(in, in, (1 << 0) + (1 << 2) + (3 << 4) + (3 << 6))
 1944-#define stbir__simdf8_0123to00220022(out, in)                                  \
 1945-	(out) = _mm256_shuffle_ps(in, in, (0 << 0) + (0 << 2) + (2 << 4) + (2 << 6))
 1946-
 1947-#define stbir__simdf8_aaa1(out, alp, ones)                                     \
 1948-	(out) = _mm256_blend_ps(alp, ones,                                         \
 1949-	                        (1 << 0) + (1 << 1) + (1 << 2) + (0 << 3) +        \
 1950-	                            (1 << 4) + (1 << 5) + (1 << 6) + (0 << 7));    \
 1951-	(out) =                                                                    \
 1952-	    _mm256_shuffle_ps(out, out, (3 << 0) + (3 << 2) + (3 << 4) + (0 << 6))
 1953-#define stbir__simdf8_1aaa(out, alp, ones)                                     \
 1954-	(out) = _mm256_blend_ps(alp, ones,                                         \
 1955-	                        (0 << 0) + (1 << 1) + (1 << 2) + (1 << 3) +        \
 1956-	                            (0 << 4) + (1 << 5) + (1 << 6) + (1 << 7));    \
 1957-	(out) =                                                                    \
 1958-	    _mm256_shuffle_ps(out, out, (1 << 0) + (0 << 2) + (0 << 4) + (0 << 6))
 1959-#define stbir__simdf8_a1a1(out, alp, ones)                                     \
 1960-	(out) = _mm256_blend_ps(alp, ones,                                         \
 1961-	                        (1 << 0) + (0 << 1) + (1 << 2) + (0 << 3) +        \
 1962-	                            (1 << 4) + (0 << 5) + (1 << 6) + (0 << 7));    \
 1963-	(out) =                                                                    \
 1964-	    _mm256_shuffle_ps(out, out, (1 << 0) + (0 << 2) + (3 << 4) + (2 << 6))
 1965-#define stbir__simdf8_1a1a(out, alp, ones)                                     \
 1966-	(out) = _mm256_blend_ps(alp, ones,                                         \
 1967-	                        (0 << 0) + (1 << 1) + (0 << 2) + (1 << 3) +        \
 1968-	                            (0 << 4) + (1 << 5) + (0 << 6) + (1 << 7));    \
 1969-	(out) =                                                                    \
 1970-	    _mm256_shuffle_ps(out, out, (1 << 0) + (0 << 2) + (3 << 4) + (2 << 6))
 1971-
 1972-#define stbir__simdf8_zero(reg) (reg) = _mm256_setzero_ps()
 1973-
 1974-#ifdef STBIR_USE_FMA // not on by default to maintain bit identical simd to
 1975-                     // non-simd
 1976-#define stbir__simdf8_madd(out, add, mul1, mul2)                               \
 1977-	(out) = _mm256_fmadd_ps(mul1, mul2, add)
 1978-#define stbir__simdf8_madd_mem(out, add, mul, ptr)                             \
 1979-	(out) = _mm256_fmadd_ps(mul, _mm256_loadu_ps((float const *)(ptr)), add)
 1980-#define stbir__simdf8_madd_mem4(out, add, mul, ptr)                            \
 1981-	(out) =                                                                    \
 1982-	    _mm256_fmadd_ps(_mm256_setr_m128(mul, _mm_setzero_ps()),               \
 1983-	                    _mm256_setr_m128(_mm_loadu_ps((float const *)(ptr)),   \
 1984-	                                     _mm_setzero_ps()),                    \
 1985-	                    add)
 1986-#else
 1987-#define stbir__simdf8_madd(out, add, mul1, mul2)                               \
 1988-	(out) = _mm256_add_ps(add, _mm256_mul_ps(mul1, mul2))
 1989-#define stbir__simdf8_madd_mem(out, add, mul, ptr)                             \
 1990-	(out) = _mm256_add_ps(                                                     \
 1991-	    add, _mm256_mul_ps(mul, _mm256_loadu_ps((float const *)(ptr))))
 1992-#define stbir__simdf8_madd_mem4(out, add, mul, ptr)                            \
 1993-	(out) = _mm256_add_ps(                                                     \
 1994-	    add,                                                                   \
 1995-	    _mm256_setr_m128(_mm_mul_ps(mul, _mm_loadu_ps((float const *)(ptr))),  \
 1996-	                     _mm_setzero_ps()))
 1997-#endif
 1998-#define stbir__if_simdf8_cast_to_simdf4(val) _mm256_castps256_ps128(val)
 1999-
 2000-#endif
 2001-
 2002-#ifdef STBIR_FLOORF
 2003-#undef STBIR_FLOORF
 2004-#endif
 2005-#define STBIR_FLOORF stbir_simd_floorf
 2006-static stbir__inline float
 2007-stbir_simd_floorf(float x) // martins floorf
 2008-{
 2009-#if defined(STBIR_AVX) || defined(__SSE4_1__) || defined(STBIR_SSE41)
 2010-	__m128 t = _mm_set_ss(x);
 2011-	return _mm_cvtss_f32(_mm_floor_ss(t, t));
 2012-#else
 2013-	__m128 f = _mm_set_ss(x);
 2014-	__m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(f));
 2015-	__m128 r = _mm_add_ss(t, _mm_and_ps(_mm_cmplt_ss(f, t), _mm_set_ss(-1.0f)));
 2016-	return _mm_cvtss_f32(r);
 2017-#endif
 2018-}
 2019-
 2020-#ifdef STBIR_CEILF
 2021-#undef STBIR_CEILF
 2022-#endif
 2023-#define STBIR_CEILF stbir_simd_ceilf
 2024-static stbir__inline float
 2025-stbir_simd_ceilf(float x) // martins ceilf
 2026-{
 2027-#if defined(STBIR_AVX) || defined(__SSE4_1__) || defined(STBIR_SSE41)
 2028-	__m128 t = _mm_set_ss(x);
 2029-	return _mm_cvtss_f32(_mm_ceil_ss(t, t));
 2030-#else
 2031-	__m128 f = _mm_set_ss(x);
 2032-	__m128 t = _mm_cvtepi32_ps(_mm_cvttps_epi32(f));
 2033-	__m128 r = _mm_add_ss(t, _mm_and_ps(_mm_cmplt_ss(t, f), _mm_set_ss(1.0f)));
 2034-	return _mm_cvtss_f32(r);
 2035-#endif
 2036-}
 2037-
 2038-#elif defined(STBIR_NEON)
 2039-
 2040-#include <arm_neon.h>
 2041-
 2042-#define stbir__simdf float32x4_t
 2043-#define stbir__simdi uint32x4_t
 2044-
 2045-#define stbir_simdi_castf(reg) vreinterpretq_u32_f32(reg)
 2046-#define stbir_simdf_casti(reg) vreinterpretq_f32_u32(reg)
 2047-
 2048-#define stbir__simdf_load(reg, ptr) (reg) = vld1q_f32((float const *)(ptr))
 2049-#define stbir__simdi_load(reg, ptr) (reg) = vld1q_u32((uint32_t const *)(ptr))
 2050-#define stbir__simdf_load1(out, ptr)                                           \
 2051-	(out) = vld1q_dup_f32((float const *)(ptr)) // top values can be random (not
 2052-	                                            // denormal or nan for perf)
 2053-#define stbir__simdi_load1(out, ptr)                                           \
 2054-	(out) = vld1q_dup_u32((uint32_t const *)(ptr))
 2055-#define stbir__simdf_load1z(out, ptr)                                          \
 2056-	(out) = vld1q_lane_f32((float const *)(ptr), vdupq_n_f32(0),               \
 2057-	                       0) // top values must be zero
 2058-#define stbir__simdf_frep4(fvar) vdupq_n_f32(fvar)
 2059-#define stbir__simdf_load1frep4(out, fvar) (out) = vdupq_n_f32(fvar)
 2060-#define stbir__simdf_load2(out, ptr)                                           \
 2061-	(out) = vcombine_f32(                                                      \
 2062-	    vld1_f32((float const *)(ptr)),                                        \
 2063-	    vcreate_f32(                                                           \
 2064-	        0)) // top values can be random (not denormal or nan for perf)
 2065-#define stbir__simdf_load2z(out, ptr)                                          \
 2066-	(out) = vcombine_f32(vld1_f32((float const *)(ptr)),                       \
 2067-	                     vcreate_f32(0)) // top values must be zero
 2068-#define stbir__simdf_load2hmerge(out, reg, ptr)                                \
 2069-	(out) = vcombine_f32(vget_low_f32(reg), vld1_f32((float const *)(ptr)))
 2070-
 2071-#define stbir__simdf_zeroP() vdupq_n_f32(0)
 2072-#define stbir__simdf_zero(reg) (reg) = vdupq_n_f32(0)
 2073-
 2074-#define stbir__simdf_store(ptr, reg) vst1q_f32((float *)(ptr), reg)
 2075-#define stbir__simdf_store1(ptr, reg) vst1q_lane_f32((float *)(ptr), reg, 0)
 2076-#define stbir__simdf_store2(ptr, reg)                                          \
 2077-	vst1_f32((float *)(ptr), vget_low_f32(reg))
 2078-#define stbir__simdf_store2h(ptr, reg)                                         \
 2079-	vst1_f32((float *)(ptr), vget_high_f32(reg))
 2080-
 2081-#define stbir__simdi_store(ptr, reg) vst1q_u32((uint32_t *)(ptr), reg)
 2082-#define stbir__simdi_store1(ptr, reg) vst1q_lane_u32((uint32_t *)(ptr), reg, 0)
 2083-#define stbir__simdi_store2(ptr, reg)                                          \
 2084-	vst1_u32((uint32_t *)(ptr), vget_low_u32(reg))
 2085-
 2086-#define stbir__prefetch(ptr)
 2087-
 2088-#define stbir__simdi_expand_u8_to_u32(out0, out1, out2, out3, ireg)            \
 2089-	{                                                                          \
 2090-		uint16x8_t l = vmovl_u8(vget_low_u8(vreinterpretq_u8_u32(ireg)));      \
 2091-		uint16x8_t h = vmovl_u8(vget_high_u8(vreinterpretq_u8_u32(ireg)));     \
 2092-		out0 = vmovl_u16(vget_low_u16(l));                                     \
 2093-		out1 = vmovl_u16(vget_high_u16(l));                                    \
 2094-		out2 = vmovl_u16(vget_low_u16(h));                                     \
 2095-		out3 = vmovl_u16(vget_high_u16(h));                                    \
 2096-	}
 2097-
 2098-#define stbir__simdi_expand_u8_to_1u32(out, ireg)                              \
 2099-	{                                                                          \
 2100-		uint16x8_t tmp = vmovl_u8(vget_low_u8(vreinterpretq_u8_u32(ireg)));    \
 2101-		out = vmovl_u16(vget_low_u16(tmp));                                    \
 2102-	}
 2103-
 2104-#define stbir__simdi_expand_u16_to_u32(out0, out1, ireg)                       \
 2105-	{                                                                          \
 2106-		uint16x8_t tmp = vreinterpretq_u16_u32(ireg);                          \
 2107-		out0 = vmovl_u16(vget_low_u16(tmp));                                   \
 2108-		out1 = vmovl_u16(vget_high_u16(tmp));                                  \
 2109-	}
 2110-
 2111-#define stbir__simdf_convert_float_to_i32(i, f)                                \
 2112-	(i) = vreinterpretq_u32_s32(vcvtq_s32_f32(f))
 2113-#define stbir__simdf_convert_float_to_int(f) vgetq_lane_s32(vcvtq_s32_f32(f), 0)
 2114-#define stbir__simdi_to_int(i) (int)vgetq_lane_u32(i, 0)
 2115-#define stbir__simdf_convert_float_to_uint8(f)                                 \
 2116-	((unsigned char)vgetq_lane_s32(                                            \
 2117-	    vcvtq_s32_f32(                                                         \
 2118-	        vmaxq_f32(vminq_f32(f, STBIR__CONSTF(STBIR_max_uint8_as_float)),   \
 2119-	                  vdupq_n_f32(0))),                                        \
 2120-	    0))
 2121-#define stbir__simdf_convert_float_to_short(f)                                 \
 2122-	((unsigned short)vgetq_lane_s32(                                           \
 2123-	    vcvtq_s32_f32(                                                         \
 2124-	        vmaxq_f32(vminq_f32(f, STBIR__CONSTF(STBIR_max_uint16_as_float)),  \
 2125-	                  vdupq_n_f32(0))),                                        \
 2126-	    0))
 2127-#define stbir__simdi_convert_i32_to_float(out, ireg)                           \
 2128-	(out) = vcvtq_f32_s32(vreinterpretq_s32_u32(ireg))
 2129-#define stbir__simdf_add(out, reg0, reg1) (out) = vaddq_f32(reg0, reg1)
 2130-#define stbir__simdf_mult(out, reg0, reg1) (out) = vmulq_f32(reg0, reg1)
 2131-#define stbir__simdf_mult_mem(out, reg, ptr)                                   \
 2132-	(out) = vmulq_f32(reg, vld1q_f32((float const *)(ptr)))
 2133-#define stbir__simdf_mult1_mem(out, reg, ptr)                                  \
 2134-	(out) = vmulq_f32(reg, vld1q_dup_f32((float const *)(ptr)))
 2135-#define stbir__simdf_add_mem(out, reg, ptr)                                    \
 2136-	(out) = vaddq_f32(reg, vld1q_f32((float const *)(ptr)))
 2137-#define stbir__simdf_add1_mem(out, reg, ptr)                                   \
 2138-	(out) = vaddq_f32(reg, vld1q_dup_f32((float const *)(ptr)))
 2139-
 2140-#ifdef STBIR_USE_FMA // not on by default to maintain bit identical simd to
 2141-                     // non-simd (and also x64 no madd to arm madd)
 2142-#define stbir__simdf_madd(out, add, mul1, mul2)                                \
 2143-	(out) = vfmaq_f32(add, mul1, mul2)
 2144-#define stbir__simdf_madd1(out, add, mul1, mul2)                               \
 2145-	(out) = vfmaq_f32(add, mul1, mul2)
 2146-#define stbir__simdf_madd_mem(out, add, mul, ptr)                              \
 2147-	(out) = vfmaq_f32(add, mul, vld1q_f32((float const *)(ptr)))
 2148-#define stbir__simdf_madd1_mem(out, add, mul, ptr)                             \
 2149-	(out) = vfmaq_f32(add, mul, vld1q_dup_f32((float const *)(ptr)))
 2150-#else
 2151-#define stbir__simdf_madd(out, add, mul1, mul2)                                \
 2152-	(out) = vaddq_f32(add, vmulq_f32(mul1, mul2))
 2153-#define stbir__simdf_madd1(out, add, mul1, mul2)                               \
 2154-	(out) = vaddq_f32(add, vmulq_f32(mul1, mul2))
 2155-#define stbir__simdf_madd_mem(out, add, mul, ptr)                              \
 2156-	(out) = vaddq_f32(add, vmulq_f32(mul, vld1q_f32((float const *)(ptr))))
 2157-#define stbir__simdf_madd1_mem(out, add, mul, ptr)                             \
 2158-	(out) = vaddq_f32(add, vmulq_f32(mul, vld1q_dup_f32((float const *)(ptr))))
 2159-#endif
 2160-
 2161-#define stbir__simdf_add1(out, reg0, reg1) (out) = vaddq_f32(reg0, reg1)
 2162-#define stbir__simdf_mult1(out, reg0, reg1) (out) = vmulq_f32(reg0, reg1)
 2163-
 2164-#define stbir__simdf_and(out, reg0, reg1)                                      \
 2165-	(out) = vreinterpretq_f32_u32(                                             \
 2166-	    vandq_u32(vreinterpretq_u32_f32(reg0), vreinterpretq_u32_f32(reg1)))
 2167-#define stbir__simdf_or(out, reg0, reg1)                                       \
 2168-	(out) = vreinterpretq_f32_u32(                                             \
 2169-	    vorrq_u32(vreinterpretq_u32_f32(reg0), vreinterpretq_u32_f32(reg1)))
 2170-
 2171-#define stbir__simdf_min(out, reg0, reg1) (out) = vminq_f32(reg0, reg1)
 2172-#define stbir__simdf_max(out, reg0, reg1) (out) = vmaxq_f32(reg0, reg1)
 2173-#define stbir__simdf_min1(out, reg0, reg1) (out) = vminq_f32(reg0, reg1)
 2174-#define stbir__simdf_max1(out, reg0, reg1) (out) = vmaxq_f32(reg0, reg1)
 2175-
 2176-#define stbir__simdf_0123ABCDto3ABx(out, reg0, reg1)                           \
 2177-	(out) = vextq_f32(reg0, reg1, 3)
 2178-#define stbir__simdf_0123ABCDto23Ax(out, reg0, reg1)                           \
 2179-	(out) = vextq_f32(reg0, reg1, 2)
 2180-
 2181-#define stbir__simdf_a1a1(out, alp, ones)                                      \
 2182-	(out) = vzipq_f32(vuzpq_f32(alp, alp).val[1], ones).val[0]
 2183-#define stbir__simdf_1a1a(out, alp, ones)                                      \
 2184-	(out) = vzipq_f32(ones, vuzpq_f32(alp, alp).val[0]).val[0]
 2185-
 2186-#if defined(_M_ARM64) || defined(__aarch64__) || defined(__arm64__)
 2187-
 2188-#define stbir__simdf_aaa1(out, alp, ones)                                      \
 2189-	(out) = vcopyq_laneq_f32(vdupq_n_f32(vgetq_lane_f32(alp, 3)), 3, ones, 3)
 2190-#define stbir__simdf_1aaa(out, alp, ones)                                      \
 2191-	(out) = vcopyq_laneq_f32(vdupq_n_f32(vgetq_lane_f32(alp, 0)), 0, ones, 0)
 2192-
 2193-#if defined(_MSC_VER) && !defined(__clang__)
 2194-#define stbir_make16(a, b, c, d)                                               \
 2195-	vcombine_u8(                                                               \
 2196-	    vcreate_u8((4 * a + 0) | ((4 * a + 1) << 8) | ((4 * a + 2) << 16) |    \
 2197-	               ((4 * a + 3) << 24) | ((stbir_uint64)(4 * b + 0) << 32) |   \
 2198-	               ((stbir_uint64)(4 * b + 1) << 40) |                         \
 2199-	               ((stbir_uint64)(4 * b + 2) << 48) |                         \
 2200-	               ((stbir_uint64)(4 * b + 3) << 56)),                         \
 2201-	    vcreate_u8((4 * c + 0) | ((4 * c + 1) << 8) | ((4 * c + 2) << 16) |    \
 2202-	               ((4 * c + 3) << 24) | ((stbir_uint64)(4 * d + 0) << 32) |   \
 2203-	               ((stbir_uint64)(4 * d + 1) << 40) |                         \
 2204-	               ((stbir_uint64)(4 * d + 2) << 48) |                         \
 2205-	               ((stbir_uint64)(4 * d + 3) << 56)))
 2206-
 2207-static stbir__inline uint8x16x2_t
 2208-stbir_make16x2(float32x4_t rega, float32x4_t regb)
 2209-{
 2210-	uint8x16x2_t r = {vreinterpretq_u8_f32(rega), vreinterpretq_u8_f32(regb)};
 2211-	return r;
 2212-}
 2213-#else
 2214-#define stbir_make16(a, b, c, d)                                               \
 2215-	(uint8x16_t){4 * a + 0, 4 * a + 1, 4 * a + 2, 4 * a + 3,                   \
 2216-	             4 * b + 0, 4 * b + 1, 4 * b + 2, 4 * b + 3,                   \
 2217-	             4 * c + 0, 4 * c + 1, 4 * c + 2, 4 * c + 3,                   \
 2218-	             4 * d + 0, 4 * d + 1, 4 * d + 2, 4 * d + 3}
 2219-#define stbir_make16x2(a, b)                                                   \
 2220-	(uint8x16x2_t)                                                             \
 2221-	{                                                                          \
 2222-		{                                                                      \
 2223-			vreinterpretq_u8_f32(a), vreinterpretq_u8_f32(b)                   \
 2224-		}                                                                      \
 2225-	}
 2226-#endif
 2227-
 2228-#define stbir__simdf_swiz(reg, one, two, three, four)                          \
 2229-	vreinterpretq_f32_u8(vqtbl1q_u8(vreinterpretq_u8_f32(reg),                 \
 2230-	                                stbir_make16(one, two, three, four)))
 2231-#define stbir__simdf_swiz2(rega, regb, one, two, three, four)                  \
 2232-	vreinterpretq_f32_u8(vqtbl2q_u8(stbir_make16x2(rega, regb),                \
 2233-	                                stbir_make16(one, two, three, four)))
 2234-
 2235-#define stbir__simdi_16madd(out, reg0, reg1)                                   \
 2236-	{                                                                          \
 2237-		int16x8_t r0 = vreinterpretq_s16_u32(reg0);                            \
 2238-		int16x8_t r1 = vreinterpretq_s16_u32(reg1);                            \
 2239-		int32x4_t tmp0 = vmull_s16(vget_low_s16(r0), vget_low_s16(r1));        \
 2240-		int32x4_t tmp1 = vmull_s16(vget_high_s16(r0), vget_high_s16(r1));      \
 2241-		(out) = vreinterpretq_u32_s32(vpaddq_s32(tmp0, tmp1));                 \
 2242-	}
 2243-
 2244-#else
 2245-
 2246-#define stbir__simdf_aaa1(out, alp, ones)                                      \
 2247-	(out) = vsetq_lane_f32(1.0f, vdupq_n_f32(vgetq_lane_f32(alp, 3)), 3)
 2248-#define stbir__simdf_1aaa(out, alp, ones)                                      \
 2249-	(out) = vsetq_lane_f32(1.0f, vdupq_n_f32(vgetq_lane_f32(alp, 0)), 0)
 2250-
 2251-#if defined(_MSC_VER) && !defined(__clang__)
 2252-static stbir__inline uint8x8x2_t
 2253-stbir_make8x2(float32x4_t reg)
 2254-{
 2255-	uint8x8x2_t r = {{vget_low_u8(vreinterpretq_u8_f32(reg)),
 2256-	                  vget_high_u8(vreinterpretq_u8_f32(reg))}};
 2257-	return r;
 2258-}
 2259-#define stbir_make8(a, b)                                                      \
 2260-	vcreate_u8((4 * a + 0) | ((4 * a + 1) << 8) | ((4 * a + 2) << 16) |        \
 2261-	           ((4 * a + 3) << 24) | ((stbir_uint64)(4 * b + 0) << 32) |       \
 2262-	           ((stbir_uint64)(4 * b + 1) << 40) |                             \
 2263-	           ((stbir_uint64)(4 * b + 2) << 48) |                             \
 2264-	           ((stbir_uint64)(4 * b + 3) << 56))
 2265-#else
 2266-#define stbir_make8x2(reg)                                                     \
 2267-	(uint8x8x2_t)                                                              \
 2268-	{                                                                          \
 2269-		{                                                                      \
 2270-			vget_low_u8(vreinterpretq_u8_f32(reg)),                            \
 2271-			    vget_high_u8(vreinterpretq_u8_f32(reg))                        \
 2272-		}                                                                      \
 2273-	}
 2274-#define stbir_make8(a, b)                                                      \
 2275-	(uint8x8_t){4 * a + 0, 4 * a + 1, 4 * a + 2, 4 * a + 3,                    \
 2276-	            4 * b + 0, 4 * b + 1, 4 * b + 2, 4 * b + 3}
 2277-#endif
 2278-
 2279-#define stbir__simdf_swiz(reg, one, two, three, four)                          \
 2280-	vreinterpretq_f32_u8(                                                      \
 2281-	    vcombine_u8(vtbl2_u8(stbir_make8x2(reg), stbir_make8(one, two)),       \
 2282-	                vtbl2_u8(stbir_make8x2(reg), stbir_make8(three, four))))
 2283-
 2284-#define stbir__simdi_16madd(out, reg0, reg1)                                   \
 2285-	{                                                                          \
 2286-		int16x8_t r0 = vreinterpretq_s16_u32(reg0);                            \
 2287-		int16x8_t r1 = vreinterpretq_s16_u32(reg1);                            \
 2288-		int32x4_t tmp0 = vmull_s16(vget_low_s16(r0), vget_low_s16(r1));        \
 2289-		int32x4_t tmp1 = vmull_s16(vget_high_s16(r0), vget_high_s16(r1));      \
 2290-		int32x2_t out0 = vpadd_s32(vget_low_s32(tmp0), vget_high_s32(tmp0));   \
 2291-		int32x2_t out1 = vpadd_s32(vget_low_s32(tmp1), vget_high_s32(tmp1));   \
 2292-		(out) = vreinterpretq_u32_s32(vcombine_s32(out0, out1));               \
 2293-	}
 2294-
 2295-#endif
 2296-
 2297-#define stbir__simdi_and(out, reg0, reg1) (out) = vandq_u32(reg0, reg1)
 2298-#define stbir__simdi_or(out, reg0, reg1) (out) = vorrq_u32(reg0, reg1)
 2299-
 2300-#define stbir__simdf_pack_to_8bytes(out, aa, bb)                               \
 2301-	{                                                                          \
 2302-		float32x4_t af =                                                       \
 2303-		    vmaxq_f32(vminq_f32(aa, STBIR__CONSTF(STBIR_max_uint8_as_float)),  \
 2304-		              vdupq_n_f32(0));                                         \
 2305-		float32x4_t bf =                                                       \
 2306-		    vmaxq_f32(vminq_f32(bb, STBIR__CONSTF(STBIR_max_uint8_as_float)),  \
 2307-		              vdupq_n_f32(0));                                         \
 2308-		int16x4_t ai = vqmovn_s32(vcvtq_s32_f32(af));                          \
 2309-		int16x4_t bi = vqmovn_s32(vcvtq_s32_f32(bf));                          \
 2310-		uint8x8_t out8 = vqmovun_s16(vcombine_s16(ai, bi));                    \
 2311-		out = vreinterpretq_u32_u8(vcombine_u8(out8, out8));                   \
 2312-	}
 2313-
 2314-#define stbir__simdf_pack_to_8words(out, aa, bb)                               \
 2315-	{                                                                          \
 2316-		float32x4_t af =                                                       \
 2317-		    vmaxq_f32(vminq_f32(aa, STBIR__CONSTF(STBIR_max_uint16_as_float)), \
 2318-		              vdupq_n_f32(0));                                         \
 2319-		float32x4_t bf =                                                       \
 2320-		    vmaxq_f32(vminq_f32(bb, STBIR__CONSTF(STBIR_max_uint16_as_float)), \
 2321-		              vdupq_n_f32(0));                                         \
 2322-		int32x4_t ai = vcvtq_s32_f32(af);                                      \
 2323-		int32x4_t bi = vcvtq_s32_f32(bf);                                      \
 2324-		out = vreinterpretq_u32_u16(                                           \
 2325-		    vcombine_u16(vqmovun_s32(ai), vqmovun_s32(bi)));                   \
 2326-	}
 2327-
 2328-#define stbir__interleave_pack_and_store_16_u8(ptr, r0, r1, r2, r3)            \
 2329-	{                                                                          \
 2330-		int16x4x2_t tmp0 = vzip_s16(vqmovn_s32(vreinterpretq_s32_u32(r0)),     \
 2331-		                            vqmovn_s32(vreinterpretq_s32_u32(r2)));    \
 2332-		int16x4x2_t tmp1 = vzip_s16(vqmovn_s32(vreinterpretq_s32_u32(r1)),     \
 2333-		                            vqmovn_s32(vreinterpretq_s32_u32(r3)));    \
 2334-		uint8x8x2_t out = {{                                                   \
 2335-		    vqmovun_s16(vcombine_s16(tmp0.val[0], tmp0.val[1])),               \
 2336-		    vqmovun_s16(vcombine_s16(tmp1.val[0], tmp1.val[1])),               \
 2337-		}};                                                                    \
 2338-		vst2_u8(ptr, out);                                                     \
 2339-	}
 2340-
 2341-#define stbir__simdf_load4_transposed(o0, o1, o2, o3, ptr)                     \
 2342-	{                                                                          \
 2343-		float32x4x4_t tmp = vld4q_f32(ptr);                                    \
 2344-		o0 = tmp.val[0];                                                       \
 2345-		o1 = tmp.val[1];                                                       \
 2346-		o2 = tmp.val[2];                                                       \
 2347-		o3 = tmp.val[3];                                                       \
 2348-	}
 2349-
 2350-#define stbir__simdi_32shr(out, reg, imm) out = vshrq_n_u32(reg, imm)
 2351-
 2352-#if defined(_MSC_VER) && !defined(__clang__)
 2353-#define STBIR__SIMDF_CONST(var, x)                                             \
 2354-	__declspec(align(8)) float var[] = {x, x, x, x}
 2355-#define STBIR__SIMDI_CONST(var, x)                                             \
 2356-	__declspec(align(8)) uint32_t var[] = {x, x, x, x}
 2357-#define STBIR__CONSTF(var) (*(const float32x4_t *)var)
 2358-#define STBIR__CONSTI(var) (*(const uint32x4_t *)var)
 2359-#else
 2360-#define STBIR__SIMDF_CONST(var, x) stbir__simdf var = {x, x, x, x}
 2361-#define STBIR__SIMDI_CONST(var, x) stbir__simdi var = {x, x, x, x}
 2362-#define STBIR__CONSTF(var) (var)
 2363-#define STBIR__CONSTI(var) (var)
 2364-#endif
 2365-
 2366-#ifdef STBIR_FLOORF
 2367-#undef STBIR_FLOORF
 2368-#endif
 2369-#define STBIR_FLOORF stbir_simd_floorf
 2370-static stbir__inline float
 2371-stbir_simd_floorf(float x)
 2372-{
 2373-#if defined(_M_ARM64) || defined(__aarch64__) || defined(__arm64__)
 2374-	return vget_lane_f32(vrndm_f32(vdup_n_f32(x)), 0);
 2375-#else
 2376-	float32x2_t f = vdup_n_f32(x);
 2377-	float32x2_t t = vcvt_f32_s32(vcvt_s32_f32(f));
 2378-	uint32x2_t a = vclt_f32(f, t);
 2379-	uint32x2_t b = vreinterpret_u32_f32(vdup_n_f32(-1.0f));
 2380-	float32x2_t r = vadd_f32(t, vreinterpret_f32_u32(vand_u32(a, b)));
 2381-	return vget_lane_f32(r, 0);
 2382-#endif
 2383-}
 2384-
 2385-#ifdef STBIR_CEILF
 2386-#undef STBIR_CEILF
 2387-#endif
 2388-#define STBIR_CEILF stbir_simd_ceilf
 2389-static stbir__inline float
 2390-stbir_simd_ceilf(float x)
 2391-{
 2392-#if defined(_M_ARM64) || defined(__aarch64__) || defined(__arm64__)
 2393-	return vget_lane_f32(vrndp_f32(vdup_n_f32(x)), 0);
 2394-#else
 2395-	float32x2_t f = vdup_n_f32(x);
 2396-	float32x2_t t = vcvt_f32_s32(vcvt_s32_f32(f));
 2397-	uint32x2_t a = vclt_f32(t, f);
 2398-	uint32x2_t b = vreinterpret_u32_f32(vdup_n_f32(1.0f));
 2399-	float32x2_t r = vadd_f32(t, vreinterpret_f32_u32(vand_u32(a, b)));
 2400-	return vget_lane_f32(r, 0);
 2401-#endif
 2402-}
 2403-
 2404-#define STBIR_SIMD
 2405-
 2406-#elif defined(STBIR_WASM)
 2407-
 2408-#include <wasm_simd128.h>
 2409-
 2410-#define stbir__simdf v128_t
 2411-#define stbir__simdi v128_t
 2412-
 2413-#define stbir_simdi_castf(reg) (reg)
 2414-#define stbir_simdf_casti(reg) (reg)
 2415-
 2416-#define stbir__simdf_load(reg, ptr) (reg) = wasm_v128_load((void const *)(ptr))
 2417-#define stbir__simdi_load(reg, ptr) (reg) = wasm_v128_load((void const *)(ptr))
 2418-#define stbir__simdf_load1(out, ptr)                                           \
 2419-	(out) = wasm_v128_load32_splat(                                            \
 2420-	    (void const *)(ptr)) // top values can be random (not denormal or nan
 2421-	                         // for perf)
 2422-#define stbir__simdi_load1(out, ptr)                                           \
 2423-	(out) = wasm_v128_load32_splat((void const *)(ptr))
 2424-#define stbir__simdf_load1z(out, ptr)                                          \
 2425-	(out) =                                                                    \
 2426-	    wasm_v128_load32_zero((void const *)(ptr)) // top values must be zero
 2427-#define stbir__simdf_frep4(fvar) wasm_f32x4_splat(fvar)
 2428-#define stbir__simdf_load1frep4(out, fvar) (out) = wasm_f32x4_splat(fvar)
 2429-#define stbir__simdf_load2(out, ptr)                                           \
 2430-	(out) = wasm_v128_load64_splat(                                            \
 2431-	    (void const *)(ptr)) // top values can be random (not denormal or nan
 2432-	                         // for perf)
 2433-#define stbir__simdf_load2z(out, ptr)                                          \
 2434-	(out) =                                                                    \
 2435-	    wasm_v128_load64_zero((void const *)(ptr)) // top values must be zero
 2436-#define stbir__simdf_load2hmerge(out, reg, ptr)                                \
 2437-	(out) = wasm_v128_load64_lane((void const *)(ptr), reg, 1)
 2438-
 2439-#define stbir__simdf_zeroP() wasm_f32x4_const_splat(0)
 2440-#define stbir__simdf_zero(reg) (reg) = wasm_f32x4_const_splat(0)
 2441-
 2442-#define stbir__simdf_store(ptr, reg) wasm_v128_store((void *)(ptr), reg)
 2443-#define stbir__simdf_store1(ptr, reg)                                          \
 2444-	wasm_v128_store32_lane((void *)(ptr), reg, 0)
 2445-#define stbir__simdf_store2(ptr, reg)                                          \
 2446-	wasm_v128_store64_lane((void *)(ptr), reg, 0)
 2447-#define stbir__simdf_store2h(ptr, reg)                                         \
 2448-	wasm_v128_store64_lane((void *)(ptr), reg, 1)
 2449-
 2450-#define stbir__simdi_store(ptr, reg) wasm_v128_store((void *)(ptr), reg)
 2451-#define stbir__simdi_store1(ptr, reg)                                          \
 2452-	wasm_v128_store32_lane((void *)(ptr), reg, 0)
 2453-#define stbir__simdi_store2(ptr, reg)                                          \
 2454-	wasm_v128_store64_lane((void *)(ptr), reg, 0)
 2455-
 2456-#define stbir__prefetch(ptr)
 2457-
 2458-#define stbir__simdi_expand_u8_to_u32(out0, out1, out2, out3, ireg)            \
 2459-	{                                                                          \
 2460-		v128_t l = wasm_u16x8_extend_low_u8x16(ireg);                          \
 2461-		v128_t h = wasm_u16x8_extend_high_u8x16(ireg);                         \
 2462-		out0 = wasm_u32x4_extend_low_u16x8(l);                                 \
 2463-		out1 = wasm_u32x4_extend_high_u16x8(l);                                \
 2464-		out2 = wasm_u32x4_extend_low_u16x8(h);                                 \
 2465-		out3 = wasm_u32x4_extend_high_u16x8(h);                                \
 2466-	}
 2467-
 2468-#define stbir__simdi_expand_u8_to_1u32(out, ireg)                              \
 2469-	{                                                                          \
 2470-		v128_t tmp = wasm_u16x8_extend_low_u8x16(ireg);                        \
 2471-		out = wasm_u32x4_extend_low_u16x8(tmp);                                \
 2472-	}
 2473-
 2474-#define stbir__simdi_expand_u16_to_u32(out0, out1, ireg)                       \
 2475-	{                                                                          \
 2476-		out0 = wasm_u32x4_extend_low_u16x8(ireg);                              \
 2477-		out1 = wasm_u32x4_extend_high_u16x8(ireg);                             \
 2478-	}
 2479-
 2480-#define stbir__simdf_convert_float_to_i32(i, f)                                \
 2481-	(i) = wasm_i32x4_trunc_sat_f32x4(f)
 2482-#define stbir__simdf_convert_float_to_int(f)                                   \
 2483-	wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(f), 0)
 2484-#define stbir__simdi_to_int(i) wasm_i32x4_extract_lane(i, 0)
 2485-#define stbir__simdf_convert_float_to_uint8(f)                                 \
 2486-	((unsigned char)wasm_i32x4_extract_lane(                                   \
 2487-	    wasm_i32x4_trunc_sat_f32x4(                                            \
 2488-	        wasm_f32x4_max(wasm_f32x4_min(f, STBIR_max_uint8_as_float),        \
 2489-	                       wasm_f32x4_const_splat(0))),                        \
 2490-	    0))
 2491-#define stbir__simdf_convert_float_to_short(f)                                 \
 2492-	((unsigned short)wasm_i32x4_extract_lane(                                  \
 2493-	    wasm_i32x4_trunc_sat_f32x4(                                            \
 2494-	        wasm_f32x4_max(wasm_f32x4_min(f, STBIR_max_uint16_as_float),       \
 2495-	                       wasm_f32x4_const_splat(0))),                        \
 2496-	    0))
 2497-#define stbir__simdi_convert_i32_to_float(out, ireg)                           \
 2498-	(out) = wasm_f32x4_convert_i32x4(ireg)
 2499-#define stbir__simdf_add(out, reg0, reg1) (out) = wasm_f32x4_add(reg0, reg1)
 2500-#define stbir__simdf_mult(out, reg0, reg1) (out) = wasm_f32x4_mul(reg0, reg1)
 2501-#define stbir__simdf_mult_mem(out, reg, ptr)                                   \
 2502-	(out) = wasm_f32x4_mul(reg, wasm_v128_load((void const *)(ptr)))
 2503-#define stbir__simdf_mult1_mem(out, reg, ptr)                                  \
 2504-	(out) = wasm_f32x4_mul(reg, wasm_v128_load32_splat((void const *)(ptr)))
 2505-#define stbir__simdf_add_mem(out, reg, ptr)                                    \
 2506-	(out) = wasm_f32x4_add(reg, wasm_v128_load((void const *)(ptr)))
 2507-#define stbir__simdf_add1_mem(out, reg, ptr)                                   \
 2508-	(out) = wasm_f32x4_add(reg, wasm_v128_load32_splat((void const *)(ptr)))
 2509-
 2510-#define stbir__simdf_madd(out, add, mul1, mul2)                                \
 2511-	(out) = wasm_f32x4_add(add, wasm_f32x4_mul(mul1, mul2))
 2512-#define stbir__simdf_madd1(out, add, mul1, mul2)                               \
 2513-	(out) = wasm_f32x4_add(add, wasm_f32x4_mul(mul1, mul2))
 2514-#define stbir__simdf_madd_mem(out, add, mul, ptr)                              \
 2515-	(out) = wasm_f32x4_add(                                                    \
 2516-	    add, wasm_f32x4_mul(mul, wasm_v128_load((void const *)(ptr))))
 2517-#define stbir__simdf_madd1_mem(out, add, mul, ptr)                             \
 2518-	(out) = wasm_f32x4_add(                                                    \
 2519-	    add, wasm_f32x4_mul(mul, wasm_v128_load32_splat((void const *)(ptr))))
 2520-
 2521-#define stbir__simdf_add1(out, reg0, reg1) (out) = wasm_f32x4_add(reg0, reg1)
 2522-#define stbir__simdf_mult1(out, reg0, reg1) (out) = wasm_f32x4_mul(reg0, reg1)
 2523-
 2524-#define stbir__simdf_and(out, reg0, reg1) (out) = wasm_v128_and(reg0, reg1)
 2525-#define stbir__simdf_or(out, reg0, reg1) (out) = wasm_v128_or(reg0, reg1)
 2526-
 2527-#define stbir__simdf_min(out, reg0, reg1) (out) = wasm_f32x4_min(reg0, reg1)
 2528-#define stbir__simdf_max(out, reg0, reg1) (out) = wasm_f32x4_max(reg0, reg1)
 2529-#define stbir__simdf_min1(out, reg0, reg1) (out) = wasm_f32x4_min(reg0, reg1)
 2530-#define stbir__simdf_max1(out, reg0, reg1) (out) = wasm_f32x4_max(reg0, reg1)
 2531-
 2532-#define stbir__simdf_0123ABCDto3ABx(out, reg0, reg1)                           \
 2533-	(out) = wasm_i32x4_shuffle(reg0, reg1, 3, 4, 5, -1)
 2534-#define stbir__simdf_0123ABCDto23Ax(out, reg0, reg1)                           \
 2535-	(out) = wasm_i32x4_shuffle(reg0, reg1, 2, 3, 4, -1)
 2536-
 2537-#define stbir__simdf_aaa1(out, alp, ones)                                      \
 2538-	(out) = wasm_i32x4_shuffle(alp, ones, 3, 3, 3, 4)
 2539-#define stbir__simdf_1aaa(out, alp, ones)                                      \
 2540-	(out) = wasm_i32x4_shuffle(alp, ones, 4, 0, 0, 0)
 2541-#define stbir__simdf_a1a1(out, alp, ones)                                      \
 2542-	(out) = wasm_i32x4_shuffle(alp, ones, 1, 4, 3, 4)
 2543-#define stbir__simdf_1a1a(out, alp, ones)                                      \
 2544-	(out) = wasm_i32x4_shuffle(alp, ones, 4, 0, 4, 2)
 2545-
 2546-#define stbir__simdf_swiz(reg, one, two, three, four)                          \
 2547-	wasm_i32x4_shuffle(reg, reg, one, two, three, four)
 2548-
 2549-#define stbir__simdi_and(out, reg0, reg1) (out) = wasm_v128_and(reg0, reg1)
 2550-#define stbir__simdi_or(out, reg0, reg1) (out) = wasm_v128_or(reg0, reg1)
 2551-#define stbir__simdi_16madd(out, reg0, reg1)                                   \
 2552-	(out) = wasm_i32x4_dot_i16x8(reg0, reg1)
 2553-
 2554-#define stbir__simdf_pack_to_8bytes(out, aa, bb)                               \
 2555-	{                                                                          \
 2556-		v128_t af =                                                            \
 2557-		    wasm_f32x4_max(wasm_f32x4_min(aa, STBIR_max_uint8_as_float),       \
 2558-		                   wasm_f32x4_const_splat(0));                         \
 2559-		v128_t bf =                                                            \
 2560-		    wasm_f32x4_max(wasm_f32x4_min(bb, STBIR_max_uint8_as_float),       \
 2561-		                   wasm_f32x4_const_splat(0));                         \
 2562-		v128_t ai = wasm_i32x4_trunc_sat_f32x4(af);                            \
 2563-		v128_t bi = wasm_i32x4_trunc_sat_f32x4(bf);                            \
 2564-		v128_t out16 = wasm_i16x8_narrow_i32x4(ai, bi);                        \
 2565-		out = wasm_u8x16_narrow_i16x8(out16, out16);                           \
 2566-	}
 2567-
 2568-#define stbir__simdf_pack_to_8words(out, aa, bb)                               \
 2569-	{                                                                          \
 2570-		v128_t af =                                                            \
 2571-		    wasm_f32x4_max(wasm_f32x4_min(aa, STBIR_max_uint16_as_float),      \
 2572-		                   wasm_f32x4_const_splat(0));                         \
 2573-		v128_t bf =                                                            \
 2574-		    wasm_f32x4_max(wasm_f32x4_min(bb, STBIR_max_uint16_as_float),      \
 2575-		                   wasm_f32x4_const_splat(0));                         \
 2576-		v128_t ai = wasm_i32x4_trunc_sat_f32x4(af);                            \
 2577-		v128_t bi = wasm_i32x4_trunc_sat_f32x4(bf);                            \
 2578-		out = wasm_u16x8_narrow_i32x4(ai, bi);                                 \
 2579-	}
 2580-
 2581-#define stbir__interleave_pack_and_store_16_u8(ptr, r0, r1, r2, r3)            \
 2582-	{                                                                          \
 2583-		v128_t tmp0 = wasm_i16x8_narrow_i32x4(r0, r1);                         \
 2584-		v128_t tmp1 = wasm_i16x8_narrow_i32x4(r2, r3);                         \
 2585-		v128_t tmp = wasm_u8x16_narrow_i16x8(tmp0, tmp1);                      \
 2586-		tmp = wasm_i8x16_shuffle(tmp, tmp, 0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, \
 2587-		                         14, 3, 7, 11, 15);                            \
 2588-		wasm_v128_store((void *)(ptr), tmp);                                   \
 2589-	}
 2590-
 2591-#define stbir__simdf_load4_transposed(o0, o1, o2, o3, ptr)                     \
 2592-	{                                                                          \
 2593-		v128_t t0 = wasm_v128_load(ptr);                                       \
 2594-		v128_t t1 = wasm_v128_load(ptr + 4);                                   \
 2595-		v128_t t2 = wasm_v128_load(ptr + 8);                                   \
 2596-		v128_t t3 = wasm_v128_load(ptr + 12);                                  \
 2597-		v128_t s0 = wasm_i32x4_shuffle(t0, t1, 0, 4, 2, 6);                    \
 2598-		v128_t s1 = wasm_i32x4_shuffle(t0, t1, 1, 5, 3, 7);                    \
 2599-		v128_t s2 = wasm_i32x4_shuffle(t2, t3, 0, 4, 2, 6);                    \
 2600-		v128_t s3 = wasm_i32x4_shuffle(t2, t3, 1, 5, 3, 7);                    \
 2601-		o0 = wasm_i32x4_shuffle(s0, s2, 0, 1, 4, 5);                           \
 2602-		o1 = wasm_i32x4_shuffle(s1, s3, 0, 1, 4, 5);                           \
 2603-		o2 = wasm_i32x4_shuffle(s0, s2, 2, 3, 6, 7);                           \
 2604-		o3 = wasm_i32x4_shuffle(s1, s3, 2, 3, 6, 7);                           \
 2605-	}
 2606-
 2607-#define stbir__simdi_32shr(out, reg, imm) out = wasm_u32x4_shr(reg, imm)
 2608-
 2609-typedef float stbir__f32x4
 2610-    __attribute__((__vector_size__(16), __aligned__(16)));
 2611-#define STBIR__SIMDF_CONST(var, x)                                             \
 2612-	stbir__simdf var = (v128_t)(stbir__f32x4) { x, x, x, x }
 2613-#define STBIR__SIMDI_CONST(var, x) stbir__simdi var = {x, x, x, x}
 2614-#define STBIR__CONSTF(var) (var)
 2615-#define STBIR__CONSTI(var) (var)
 2616-
 2617-#ifdef STBIR_FLOORF
 2618-#undef STBIR_FLOORF
 2619-#endif
 2620-#define STBIR_FLOORF stbir_simd_floorf
 2621-static stbir__inline float
 2622-stbir_simd_floorf(float x)
 2623-{
 2624-	return wasm_f32x4_extract_lane(wasm_f32x4_floor(wasm_f32x4_splat(x)), 0);
 2625-}
 2626-
 2627-#ifdef STBIR_CEILF
 2628-#undef STBIR_CEILF
 2629-#endif
 2630-#define STBIR_CEILF stbir_simd_ceilf
 2631-static stbir__inline float
 2632-stbir_simd_ceilf(float x)
 2633-{
 2634-	return wasm_f32x4_extract_lane(wasm_f32x4_ceil(wasm_f32x4_splat(x)), 0);
 2635-}
 2636-
 2637-#define STBIR_SIMD
 2638-
 2639-#endif // SSE2/NEON/WASM
 2640-
 2641-#endif // NO SIMD
 2642-
 2643-#ifdef STBIR_SIMD8
 2644-#define stbir__simdfX stbir__simdf8
 2645-#define stbir__simdiX stbir__simdi8
 2646-#define stbir__simdfX_load stbir__simdf8_load
 2647-#define stbir__simdiX_load stbir__simdi8_load
 2648-#define stbir__simdfX_mult stbir__simdf8_mult
 2649-#define stbir__simdfX_add_mem stbir__simdf8_add_mem
 2650-#define stbir__simdfX_madd_mem stbir__simdf8_madd_mem
 2651-#define stbir__simdfX_store stbir__simdf8_store
 2652-#define stbir__simdiX_store stbir__simdi8_store
 2653-#define stbir__simdf_frepX stbir__simdf8_frep8
 2654-#define stbir__simdfX_madd stbir__simdf8_madd
 2655-#define stbir__simdfX_min stbir__simdf8_min
 2656-#define stbir__simdfX_max stbir__simdf8_max
 2657-#define stbir__simdfX_aaa1 stbir__simdf8_aaa1
 2658-#define stbir__simdfX_1aaa stbir__simdf8_1aaa
 2659-#define stbir__simdfX_a1a1 stbir__simdf8_a1a1
 2660-#define stbir__simdfX_1a1a stbir__simdf8_1a1a
 2661-#define stbir__simdfX_convert_float_to_i32 stbir__simdf8_convert_float_to_i32
 2662-#define stbir__simdfX_pack_to_words stbir__simdf8_pack_to_16words
 2663-#define stbir__simdfX_zero stbir__simdf8_zero
 2664-#define STBIR_onesX STBIR_ones8
 2665-#define STBIR_max_uint8_as_floatX STBIR_max_uint8_as_float8
 2666-#define STBIR_max_uint16_as_floatX STBIR_max_uint16_as_float8
 2667-#define STBIR_simd_point5X STBIR_simd_point58
 2668-#define stbir__simdfX_float_count 8
 2669-#define stbir__simdfX_0123to1230 stbir__simdf8_0123to12301230
 2670-#define stbir__simdfX_0123to2103 stbir__simdf8_0123to21032103
 2671-static const stbir__simdf8 STBIR_max_uint16_as_float_inverted8 = {
 2672-    stbir__max_uint16_as_float_inverted, stbir__max_uint16_as_float_inverted,
 2673-    stbir__max_uint16_as_float_inverted, stbir__max_uint16_as_float_inverted,
 2674-    stbir__max_uint16_as_float_inverted, stbir__max_uint16_as_float_inverted,
 2675-    stbir__max_uint16_as_float_inverted, stbir__max_uint16_as_float_inverted};
 2676-static const stbir__simdf8 STBIR_max_uint8_as_float_inverted8 = {
 2677-    stbir__max_uint8_as_float_inverted, stbir__max_uint8_as_float_inverted,
 2678-    stbir__max_uint8_as_float_inverted, stbir__max_uint8_as_float_inverted,
 2679-    stbir__max_uint8_as_float_inverted, stbir__max_uint8_as_float_inverted,
 2680-    stbir__max_uint8_as_float_inverted, stbir__max_uint8_as_float_inverted};
 2681-static const stbir__simdf8 STBIR_ones8 = {1.0, 1.0, 1.0, 1.0,
 2682-                                          1.0, 1.0, 1.0, 1.0};
 2683-static const stbir__simdf8 STBIR_simd_point58 = {0.5, 0.5, 0.5, 0.5,
 2684-                                                 0.5, 0.5, 0.5, 0.5};
 2685-static const stbir__simdf8 STBIR_max_uint8_as_float8 = {
 2686-    stbir__max_uint8_as_float, stbir__max_uint8_as_float,
 2687-    stbir__max_uint8_as_float, stbir__max_uint8_as_float,
 2688-    stbir__max_uint8_as_float, stbir__max_uint8_as_float,
 2689-    stbir__max_uint8_as_float, stbir__max_uint8_as_float};
 2690-static const stbir__simdf8 STBIR_max_uint16_as_float8 = {
 2691-    stbir__max_uint16_as_float, stbir__max_uint16_as_float,
 2692-    stbir__max_uint16_as_float, stbir__max_uint16_as_float,
 2693-    stbir__max_uint16_as_float, stbir__max_uint16_as_float,
 2694-    stbir__max_uint16_as_float, stbir__max_uint16_as_float};
 2695-#else
 2696-#define stbir__simdfX stbir__simdf
 2697-#define stbir__simdiX stbir__simdi
 2698-#define stbir__simdfX_load stbir__simdf_load
 2699-#define stbir__simdiX_load stbir__simdi_load
 2700-#define stbir__simdfX_mult stbir__simdf_mult
 2701-#define stbir__simdfX_add_mem stbir__simdf_add_mem
 2702-#define stbir__simdfX_madd_mem stbir__simdf_madd_mem
 2703-#define stbir__simdfX_store stbir__simdf_store
 2704-#define stbir__simdiX_store stbir__simdi_store
 2705-#define stbir__simdf_frepX stbir__simdf_frep4
 2706-#define stbir__simdfX_madd stbir__simdf_madd
 2707-#define stbir__simdfX_min stbir__simdf_min
 2708-#define stbir__simdfX_max stbir__simdf_max
 2709-#define stbir__simdfX_aaa1 stbir__simdf_aaa1
 2710-#define stbir__simdfX_1aaa stbir__simdf_1aaa
 2711-#define stbir__simdfX_a1a1 stbir__simdf_a1a1
 2712-#define stbir__simdfX_1a1a stbir__simdf_1a1a
 2713-#define stbir__simdfX_convert_float_to_i32 stbir__simdf_convert_float_to_i32
 2714-#define stbir__simdfX_pack_to_words stbir__simdf_pack_to_8words
 2715-#define stbir__simdfX_zero stbir__simdf_zero
 2716-#define STBIR_onesX STBIR__CONSTF(STBIR_ones)
 2717-#define STBIR_simd_point5X STBIR__CONSTF(STBIR_simd_point5)
 2718-#define STBIR_max_uint8_as_floatX STBIR__CONSTF(STBIR_max_uint8_as_float)
 2719-#define STBIR_max_uint16_as_floatX STBIR__CONSTF(STBIR_max_uint16_as_float)
 2720-#define stbir__simdfX_float_count 4
 2721-#define stbir__if_simdf8_cast_to_simdf4(val) (val)
 2722-#define stbir__simdfX_0123to1230 stbir__simdf_0123to1230
 2723-#define stbir__simdfX_0123to2103 stbir__simdf_0123to2103
 2724-#endif
 2725-
 2726-#if defined(STBIR_NEON) && !defined(_M_ARM) && !defined(__arm__)
 2727-
 2728-#if defined(_MSC_VER) && !defined(__clang__)
 2729-typedef __int16 stbir__FP16;
 2730-#else
 2731-typedef float16_t stbir__FP16;
 2732-#endif
 2733-
 2734-#else // no NEON, or 32-bit ARM for MSVC
 2735-
 2736-typedef union stbir__FP16 {
 2737-	unsigned short u;
 2738-} stbir__FP16;
 2739-
 2740-#endif
 2741-
 2742-#if (!defined(STBIR_NEON) && !defined(STBIR_FP16C)) ||                         \
 2743-    (defined(STBIR_NEON) && defined(_M_ARM)) ||                                \
 2744-    (defined(STBIR_NEON) && defined(__arm__))
 2745-
 2746-// Fabian's half float routines, see: https://gist.github.com/rygorous/2156668
 2747-
 2748-static stbir__inline float
 2749-stbir__half_to_float(stbir__FP16 h)
 2750-{
 2751-	static const stbir__FP32 magic = {(254 - 15) << 23};
 2752-	static const stbir__FP32 was_infnan = {(127 + 16) << 23};
 2753-	stbir__FP32 o;
 2754-
 2755-	o.u = (h.u & 0x7fff) << 13; // exponent/mantissa bits
 2756-	o.f *= magic.f;             // exponent adjust
 2757-	if (o.f >= was_infnan.f) {  // make sure Inf/NaN survive
 2758-		o.u |= 255 << 23;
 2759-	}
 2760-	o.u |= (h.u & 0x8000) << 16; // sign bit
 2761-	return o.f;
 2762-}
 2763-
 2764-static stbir__inline stbir__FP16
 2765-stbir__float_to_half(float val)
 2766-{
 2767-	stbir__FP32 f32infty = {255 << 23};
 2768-	stbir__FP32 f16max = {(127 + 16) << 23};
 2769-	stbir__FP32 denorm_magic = {((127 - 15) + (23 - 10) + 1) << 23};
 2770-	unsigned int sign_mask = 0x80000000u;
 2771-	stbir__FP16 o = {0};
 2772-	stbir__FP32 f;
 2773-	unsigned int sign;
 2774-
 2775-	f.f = val;
 2776-	sign = f.u & sign_mask;
 2777-	f.u ^= sign;
 2778-
 2779-	if (f.u >= f16max.u) { // result is Inf or NaN (all exponent bits set)
 2780-		o.u = (f.u > f32infty.u) ? 0x7e00 : 0x7c00; // NaN->qNaN and Inf->Inf
 2781-	} else // (De)normalized number or zero
 2782-	{
 2783-		if (f.u < (113 << 23)) // resulting FP16 is subnormal or zero
 2784-		{
 2785-			// use a magic value to align our 10 mantissa bits at the bottom of
 2786-			// the float. as long as FP addition is round-to-nearest-even this
 2787-			// just works.
 2788-			f.f += denorm_magic.f;
 2789-			// and one integer subtract of the bias later, we have our final
 2790-			// float!
 2791-			o.u = (unsigned short)(f.u - denorm_magic.u);
 2792-		} else {
 2793-			unsigned int mant_odd =
 2794-			    (f.u >> 13) & 1; // resulting mantissa is odd
 2795-			// update exponent, rounding bias part 1
 2796-			f.u = f.u + ((15u - 127) << 23) + 0xfff;
 2797-			// rounding bias part 2
 2798-			f.u += mant_odd;
 2799-			// take the bits!
 2800-			o.u = (unsigned short)(f.u >> 13);
 2801-		}
 2802-	}
 2803-
 2804-	o.u |= sign >> 16;
 2805-	return o;
 2806-}
 2807-
 2808-#endif
 2809-
 2810-#if defined(STBIR_FP16C)
 2811-
 2812-#include <immintrin.h>
 2813-
 2814-static stbir__inline void
 2815-stbir__half_to_float_SIMD(float *output, stbir__FP16 const *input)
 2816-{
 2817-	_mm256_storeu_ps((float *)output,
 2818-	                 _mm256_cvtph_ps(_mm_loadu_si128((__m128i const *)input)));
 2819-}
 2820-
 2821-static stbir__inline void
 2822-stbir__float_to_half_SIMD(stbir__FP16 *output, float const *input)
 2823-{
 2824-	_mm_storeu_si128((__m128i *)output,
 2825-	                 _mm256_cvtps_ph(_mm256_loadu_ps(input), 0));
 2826-}
 2827-
 2828-static stbir__inline float
 2829-stbir__half_to_float(stbir__FP16 h)
 2830-{
 2831-	return _mm_cvtss_f32(_mm_cvtph_ps(_mm_cvtsi32_si128((int)h.u)));
 2832-}
 2833-
 2834-static stbir__inline stbir__FP16
 2835-stbir__float_to_half(float f)
 2836-{
 2837-	stbir__FP16 h;
 2838-	h.u = (unsigned short)_mm_cvtsi128_si32(_mm_cvtps_ph(_mm_set_ss(f), 0));
 2839-	return h;
 2840-}
 2841-
 2842-#elif defined(STBIR_SSE2)
 2843-
 2844-// Fabian's half float routines, see: https://gist.github.com/rygorous/2156668
 2845-stbir__inline static void
 2846-stbir__half_to_float_SIMD(float *output, void const *input)
 2847-{
 2848-	static const STBIR__SIMDI_CONST(mask_nosign, 0x7fff);
 2849-	static const STBIR__SIMDI_CONST(smallest_normal, 0x0400);
 2850-	static const STBIR__SIMDI_CONST(infinity, 0x7c00);
 2851-	static const STBIR__SIMDI_CONST(expadjust_normal, (127 - 15) << 23);
 2852-	static const STBIR__SIMDI_CONST(magic_denorm, 113 << 23);
 2853-
 2854-	__m128i i = _mm_loadu_si128((__m128i const *)(input));
 2855-	__m128i h = _mm_unpacklo_epi16(i, _mm_setzero_si128());
 2856-	__m128i mnosign = STBIR__CONSTI(mask_nosign);
 2857-	__m128i eadjust = STBIR__CONSTI(expadjust_normal);
 2858-	__m128i smallest = STBIR__CONSTI(smallest_normal);
 2859-	__m128i infty = STBIR__CONSTI(infinity);
 2860-	__m128i expmant = _mm_and_si128(mnosign, h);
 2861-	__m128i justsign = _mm_xor_si128(h, expmant);
 2862-	__m128i b_notinfnan = _mm_cmpgt_epi32(infty, expmant);
 2863-	__m128i b_isdenorm = _mm_cmpgt_epi32(smallest, expmant);
 2864-	__m128i shifted = _mm_slli_epi32(expmant, 13);
 2865-	__m128i adj_infnan = _mm_andnot_si128(b_notinfnan, eadjust);
 2866-	__m128i adjusted = _mm_add_epi32(eadjust, shifted);
 2867-	__m128i den1 = _mm_add_epi32(shifted, STBIR__CONSTI(magic_denorm));
 2868-	__m128i adjusted2 = _mm_add_epi32(adjusted, adj_infnan);
 2869-	__m128 den2 =
 2870-	    _mm_sub_ps(_mm_castsi128_ps(den1), *(const __m128 *)&magic_denorm);
 2871-	__m128 adjusted3 = _mm_and_ps(den2, _mm_castsi128_ps(b_isdenorm));
 2872-	__m128 adjusted4 = _mm_andnot_ps(_mm_castsi128_ps(b_isdenorm),
 2873-	                                 _mm_castsi128_ps(adjusted2));
 2874-	__m128 adjusted5 = _mm_or_ps(adjusted3, adjusted4);
 2875-	__m128i sign = _mm_slli_epi32(justsign, 16);
 2876-	__m128 final = _mm_or_ps(adjusted5, _mm_castsi128_ps(sign));
 2877-	stbir__simdf_store(output + 0, final);
 2878-
 2879-	h = _mm_unpackhi_epi16(i, _mm_setzero_si128());
 2880-	expmant = _mm_and_si128(mnosign, h);
 2881-	justsign = _mm_xor_si128(h, expmant);
 2882-	b_notinfnan = _mm_cmpgt_epi32(infty, expmant);
 2883-	b_isdenorm = _mm_cmpgt_epi32(smallest, expmant);
 2884-	shifted = _mm_slli_epi32(expmant, 13);
 2885-	adj_infnan = _mm_andnot_si128(b_notinfnan, eadjust);
 2886-	adjusted = _mm_add_epi32(eadjust, shifted);
 2887-	den1 = _mm_add_epi32(shifted, STBIR__CONSTI(magic_denorm));
 2888-	adjusted2 = _mm_add_epi32(adjusted, adj_infnan);
 2889-	den2 = _mm_sub_ps(_mm_castsi128_ps(den1), *(const __m128 *)&magic_denorm);
 2890-	adjusted3 = _mm_and_ps(den2, _mm_castsi128_ps(b_isdenorm));
 2891-	adjusted4 = _mm_andnot_ps(_mm_castsi128_ps(b_isdenorm),
 2892-	                          _mm_castsi128_ps(adjusted2));
 2893-	adjusted5 = _mm_or_ps(adjusted3, adjusted4);
 2894-	sign = _mm_slli_epi32(justsign, 16);
 2895-	final = _mm_or_ps(adjusted5, _mm_castsi128_ps(sign));
 2896-	stbir__simdf_store(output + 4, final);
 2897-
 2898-	// ~38 SSE2 ops for 8 values
 2899-}
 2900-
 2901-// Fabian's round-to-nearest-even float to half
 2902-// ~48 SSE2 ops for 8 output
 2903-stbir__inline static void
 2904-stbir__float_to_half_SIMD(void *output, float const *input)
 2905-{
 2906-	static const STBIR__SIMDI_CONST(mask_sign, 0x80000000u);
 2907-	static const STBIR__SIMDI_CONST(
 2908-	    c_f16max, (127 + 16) << 23); // all FP32 values >=this round to +inf
 2909-	static const STBIR__SIMDI_CONST(c_nanbit, 0x200);
 2910-	static const STBIR__SIMDI_CONST(c_infty_as_fp16, 0x7c00);
 2911-	static const STBIR__SIMDI_CONST(
 2912-	    c_min_normal, (127 - 14)
 2913-	                      << 23); // smallest FP32 that yields a normalized FP16
 2914-	static const STBIR__SIMDI_CONST(c_subnorm_magic,
 2915-	                                ((127 - 15) + (23 - 10) + 1) << 23);
 2916-	static const STBIR__SIMDI_CONST(
 2917-	    c_normal_bias,
 2918-	    0xfff -
 2919-	        ((127 - 15) << 23)); // adjust exponent and add mantissa rounding
 2920-
 2921-	__m128 f = _mm_loadu_ps(input);
 2922-	__m128 msign = _mm_castsi128_ps(STBIR__CONSTI(mask_sign));
 2923-	__m128 justsign = _mm_and_ps(msign, f);
 2924-	__m128 absf = _mm_xor_ps(f, justsign);
 2925-	__m128i absf_int = _mm_castps_si128(
 2926-	    absf); // the cast is "free" (extra bypass latency, but no thruput hit)
 2927-	__m128i f16max = STBIR__CONSTI(c_f16max);
 2928-	__m128 b_isnan = _mm_cmpunord_ps(absf, absf); // is this a NaN?
 2929-	__m128i b_isregular =
 2930-	    _mm_cmpgt_epi32(f16max, absf_int); // (sub)normalized or special?
 2931-	__m128i nanbit =
 2932-	    _mm_and_si128(_mm_castps_si128(b_isnan), STBIR__CONSTI(c_nanbit));
 2933-	__m128i inf_or_nan = _mm_or_si128(
 2934-	    nanbit, STBIR__CONSTI(c_infty_as_fp16)); // output for specials
 2935-
 2936-	__m128i min_normal = STBIR__CONSTI(c_min_normal);
 2937-	__m128i b_issub = _mm_cmpgt_epi32(min_normal, absf_int);
 2938-
 2939-	// "result is subnormal" path
 2940-	__m128 subnorm1 = _mm_add_ps(
 2941-	    absf, _mm_castsi128_ps(STBIR__CONSTI(
 2942-	              c_subnorm_magic))); // magic value to round output mantissa
 2943-	__m128i subnorm2 =
 2944-	    _mm_sub_epi32(_mm_castps_si128(subnorm1),
 2945-	                  STBIR__CONSTI(c_subnorm_magic)); // subtract out bias
 2946-
 2947-	// "result is normal" path
 2948-	__m128i mantoddbit = _mm_slli_epi32(
 2949-	    absf_int, 31 - 13); // shift bit 13 (mantissa LSB) to sign
 2950-	__m128i mantodd =
 2951-	    _mm_srai_epi32(mantoddbit, 31); // -1 if FP16 mantissa odd, else 0
 2952-
 2953-	__m128i round1 = _mm_add_epi32(absf_int, STBIR__CONSTI(c_normal_bias));
 2954-	__m128i round2 = _mm_sub_epi32(
 2955-	    round1,
 2956-	    mantodd); // if mantissa LSB odd, bias towards rounding up (RTNE)
 2957-	__m128i normal = _mm_srli_epi32(round2, 13); // rounded result
 2958-
 2959-	// combine the two non-specials
 2960-	__m128i nonspecial = _mm_or_si128(_mm_and_si128(subnorm2, b_issub),
 2961-	                                  _mm_andnot_si128(b_issub, normal));
 2962-
 2963-	// merge in specials as well
 2964-	__m128i joined = _mm_or_si128(_mm_and_si128(nonspecial, b_isregular),
 2965-	                              _mm_andnot_si128(b_isregular, inf_or_nan));
 2966-
 2967-	__m128i sign_shift = _mm_srai_epi32(_mm_castps_si128(justsign), 16);
 2968-	__m128i final2, final = _mm_or_si128(joined, sign_shift);
 2969-
 2970-	f = _mm_loadu_ps(input + 4);
 2971-	justsign = _mm_and_ps(msign, f);
 2972-	absf = _mm_xor_ps(f, justsign);
 2973-	absf_int = _mm_castps_si128(
 2974-	    absf); // the cast is "free" (extra bypass latency, but no thruput hit)
 2975-	b_isnan = _mm_cmpunord_ps(absf, absf); // is this a NaN?
 2976-	b_isregular =
 2977-	    _mm_cmpgt_epi32(f16max, absf_int); // (sub)normalized or special?
 2978-	nanbit = _mm_and_si128(_mm_castps_si128(b_isnan), c_nanbit);
 2979-	inf_or_nan = _mm_or_si128(
 2980-	    nanbit, STBIR__CONSTI(c_infty_as_fp16)); // output for specials
 2981-
 2982-	b_issub = _mm_cmpgt_epi32(min_normal, absf_int);
 2983-
 2984-	// "result is subnormal" path
 2985-	subnorm1 = _mm_add_ps(
 2986-	    absf, _mm_castsi128_ps(STBIR__CONSTI(
 2987-	              c_subnorm_magic))); // magic value to round output mantissa
 2988-	subnorm2 =
 2989-	    _mm_sub_epi32(_mm_castps_si128(subnorm1),
 2990-	                  STBIR__CONSTI(c_subnorm_magic)); // subtract out bias
 2991-
 2992-	// "result is normal" path
 2993-	mantoddbit = _mm_slli_epi32(absf_int,
 2994-	                            31 - 13); // shift bit 13 (mantissa LSB) to sign
 2995-	mantodd = _mm_srai_epi32(mantoddbit, 31); // -1 if FP16 mantissa odd, else 0
 2996-
 2997-	round1 = _mm_add_epi32(absf_int, STBIR__CONSTI(c_normal_bias));
 2998-	round2 = _mm_sub_epi32(
 2999-	    round1,
 3000-	    mantodd); // if mantissa LSB odd, bias towards rounding up (RTNE)
 3001-	normal = _mm_srli_epi32(round2, 13); // rounded result
 3002-
 3003-	// combine the two non-specials
 3004-	nonspecial = _mm_or_si128(_mm_and_si128(subnorm2, b_issub),
 3005-	                          _mm_andnot_si128(b_issub, normal));
 3006-
 3007-	// merge in specials as well
 3008-	joined = _mm_or_si128(_mm_and_si128(nonspecial, b_isregular),
 3009-	                      _mm_andnot_si128(b_isregular, inf_or_nan));
 3010-
 3011-	sign_shift = _mm_srai_epi32(_mm_castps_si128(justsign), 16);
 3012-	final2 = _mm_or_si128(joined, sign_shift);
 3013-	final = _mm_packs_epi32(final, final2);
 3014-	stbir__simdi_store(output, final);
 3015-}
 3016-
 3017-#elif defined(STBIR_NEON) && defined(_MSC_VER) && defined(_M_ARM64) &&         \
 3018-    !defined(__clang__) // 64-bit ARM on MSVC (not clang)
 3019-
 3020-static stbir__inline void
 3021-stbir__half_to_float_SIMD(float *output, stbir__FP16 const *input)
 3022-{
 3023-	float16x4_t in0 = vld1_f16(input + 0);
 3024-	float16x4_t in1 = vld1_f16(input + 4);
 3025-	vst1q_f32(output + 0, vcvt_f32_f16(in0));
 3026-	vst1q_f32(output + 4, vcvt_f32_f16(in1));
 3027-}
 3028-
 3029-static stbir__inline void
 3030-stbir__float_to_half_SIMD(stbir__FP16 *output, float const *input)
 3031-{
 3032-	float16x4_t out0 = vcvt_f16_f32(vld1q_f32(input + 0));
 3033-	float16x4_t out1 = vcvt_f16_f32(vld1q_f32(input + 4));
 3034-	vst1_f16(output + 0, out0);
 3035-	vst1_f16(output + 4, out1);
 3036-}
 3037-
 3038-static stbir__inline float
 3039-stbir__half_to_float(stbir__FP16 h)
 3040-{
 3041-	return vgetq_lane_f32(vcvt_f32_f16(vld1_dup_f16(&h)), 0);
 3042-}
 3043-
 3044-static stbir__inline stbir__FP16
 3045-stbir__float_to_half(float f)
 3046-{
 3047-	return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0).n16_u16[0];
 3048-}
 3049-
 3050-#elif defined(STBIR_NEON) && (defined(_M_ARM64) || defined(__aarch64__) ||     \
 3051-                              defined(__arm64__)) // 64-bit ARM
 3052-
 3053-static stbir__inline void
 3054-stbir__half_to_float_SIMD(float *output, stbir__FP16 const *input)
 3055-{
 3056-	float16x8_t in = vld1q_f16(input);
 3057-	vst1q_f32(output + 0, vcvt_f32_f16(vget_low_f16(in)));
 3058-	vst1q_f32(output + 4, vcvt_f32_f16(vget_high_f16(in)));
 3059-}
 3060-
 3061-static stbir__inline void
 3062-stbir__float_to_half_SIMD(stbir__FP16 *output, float const *input)
 3063-{
 3064-	float16x4_t out0 = vcvt_f16_f32(vld1q_f32(input + 0));
 3065-	float16x4_t out1 = vcvt_f16_f32(vld1q_f32(input + 4));
 3066-	vst1q_f16(output, vcombine_f16(out0, out1));
 3067-}
 3068-
 3069-static stbir__inline float
 3070-stbir__half_to_float(stbir__FP16 h)
 3071-{
 3072-	return vgetq_lane_f32(vcvt_f32_f16(vdup_n_f16(h)), 0);
 3073-}
 3074-
 3075-static stbir__inline stbir__FP16
 3076-stbir__float_to_half(float f)
 3077-{
 3078-	return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0);
 3079-}
 3080-
 3081-#elif defined(STBIR_WASM) ||                                                   \
 3082-    (defined(STBIR_NEON) &&                                                    \
 3083-     (defined(_MSC_VER) || defined(_M_ARM) ||                                  \
 3084-      defined(__arm__))) // WASM or 32-bit ARM on MSVC/clang
 3085-
 3086-static stbir__inline void
 3087-stbir__half_to_float_SIMD(float *output, stbir__FP16 const *input)
 3088-{
 3089-	for (int i = 0; i < 8; i++) {
 3090-		output[i] = stbir__half_to_float(input[i]);
 3091-	}
 3092-}
 3093-static stbir__inline void
 3094-stbir__float_to_half_SIMD(stbir__FP16 *output, float const *input)
 3095-{
 3096-	for (int i = 0; i < 8; i++) {
 3097-		output[i] = stbir__float_to_half(input[i]);
 3098-	}
 3099-}
 3100-
 3101-#endif
 3102-
 3103-#ifdef STBIR_SIMD
 3104-
 3105-#define stbir__simdf_0123to3333(out, reg)                                      \
 3106-	(out) = stbir__simdf_swiz(reg, 3, 3, 3, 3)
 3107-#define stbir__simdf_0123to2222(out, reg)                                      \
 3108-	(out) = stbir__simdf_swiz(reg, 2, 2, 2, 2)
 3109-#define stbir__simdf_0123to1111(out, reg)                                      \
 3110-	(out) = stbir__simdf_swiz(reg, 1, 1, 1, 1)
 3111-#define stbir__simdf_0123to0000(out, reg)                                      \
 3112-	(out) = stbir__simdf_swiz(reg, 0, 0, 0, 0)
 3113-#define stbir__simdf_0123to0003(out, reg)                                      \
 3114-	(out) = stbir__simdf_swiz(reg, 0, 0, 0, 3)
 3115-#define stbir__simdf_0123to0001(out, reg)                                      \
 3116-	(out) = stbir__simdf_swiz(reg, 0, 0, 0, 1)
 3117-#define stbir__simdf_0123to1122(out, reg)                                      \
 3118-	(out) = stbir__simdf_swiz(reg, 1, 1, 2, 2)
 3119-#define stbir__simdf_0123to2333(out, reg)                                      \
 3120-	(out) = stbir__simdf_swiz(reg, 2, 3, 3, 3)
 3121-#define stbir__simdf_0123to0023(out, reg)                                      \
 3122-	(out) = stbir__simdf_swiz(reg, 0, 0, 2, 3)
 3123-#define stbir__simdf_0123to1230(out, reg)                                      \
 3124-	(out) = stbir__simdf_swiz(reg, 1, 2, 3, 0)
 3125-#define stbir__simdf_0123to2103(out, reg)                                      \
 3126-	(out) = stbir__simdf_swiz(reg, 2, 1, 0, 3)
 3127-#define stbir__simdf_0123to3210(out, reg)                                      \
 3128-	(out) = stbir__simdf_swiz(reg, 3, 2, 1, 0)
 3129-#define stbir__simdf_0123to2301(out, reg)                                      \
 3130-	(out) = stbir__simdf_swiz(reg, 2, 3, 0, 1)
 3131-#define stbir__simdf_0123to3012(out, reg)                                      \
 3132-	(out) = stbir__simdf_swiz(reg, 3, 0, 1, 2)
 3133-#define stbir__simdf_0123to0011(out, reg)                                      \
 3134-	(out) = stbir__simdf_swiz(reg, 0, 0, 1, 1)
 3135-#define stbir__simdf_0123to1100(out, reg)                                      \
 3136-	(out) = stbir__simdf_swiz(reg, 1, 1, 0, 0)
 3137-#define stbir__simdf_0123to2233(out, reg)                                      \
 3138-	(out) = stbir__simdf_swiz(reg, 2, 2, 3, 3)
 3139-#define stbir__simdf_0123to1133(out, reg)                                      \
 3140-	(out) = stbir__simdf_swiz(reg, 1, 1, 3, 3)
 3141-#define stbir__simdf_0123to0022(out, reg)                                      \
 3142-	(out) = stbir__simdf_swiz(reg, 0, 0, 2, 2)
 3143-#define stbir__simdf_0123to1032(out, reg)                                      \
 3144-	(out) = stbir__simdf_swiz(reg, 1, 0, 3, 2)
 3145-
 3146-typedef union stbir__simdi_u32 {
 3147-	stbir_uint32 m128i_u32[4];
 3148-	int m128i_i32[4];
 3149-	stbir__simdi m128i_i128;
 3150-} stbir__simdi_u32;
 3151-
 3152-static const int STBIR_mask[9] = {0, 0, 0, -1, -1, -1, 0, 0, 0};
 3153-
 3154-static const STBIR__SIMDF_CONST(STBIR_max_uint8_as_float,
 3155-                                stbir__max_uint8_as_float);
 3156-static const STBIR__SIMDF_CONST(STBIR_max_uint16_as_float,
 3157-                                stbir__max_uint16_as_float);
 3158-static const STBIR__SIMDF_CONST(STBIR_max_uint8_as_float_inverted,
 3159-                                stbir__max_uint8_as_float_inverted);
 3160-static const STBIR__SIMDF_CONST(STBIR_max_uint16_as_float_inverted,
 3161-                                stbir__max_uint16_as_float_inverted);
 3162-
 3163-static const STBIR__SIMDF_CONST(STBIR_simd_point5, 0.5f);
 3164-static const STBIR__SIMDF_CONST(STBIR_ones, 1.0f);
 3165-static const STBIR__SIMDI_CONST(STBIR_almost_zero, (127 - 13) << 23);
 3166-static const STBIR__SIMDI_CONST(STBIR_almost_one, 0x3f7fffff);
 3167-static const STBIR__SIMDI_CONST(STBIR_mastissa_mask, 0xff);
 3168-static const STBIR__SIMDI_CONST(STBIR_topscale, 0x02000000);
 3169-
 3170-//   Basically, in simd mode, we unroll the proper amount, and we don't want
 3171-//   the non-simd remnant loops to be unroll because they only run a few times
 3172-//   Adding this switch saves about 5K on clang which is Captain Unroll the 3rd.
 3173-#define STBIR_SIMD_STREAMOUT_PTR(star) STBIR_STREAMOUT_PTR(star)
 3174-#define STBIR_SIMD_NO_UNROLL(ptr) STBIR_NO_UNROLL(ptr)
 3175-#define STBIR_SIMD_NO_UNROLL_LOOP_START STBIR_NO_UNROLL_LOOP_START
 3176-#define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR                                \
 3177-	STBIR_NO_UNROLL_LOOP_START_INF_FOR
 3178-
 3179-#ifdef STBIR_MEMCPY
 3180-#undef STBIR_MEMCPY
 3181-#endif
 3182-#define STBIR_MEMCPY stbir_simd_memcpy
 3183-
 3184-// override normal use of memcpy with much simpler copy (faster and smaller with
 3185-// our sized copies)
 3186-static void
 3187-stbir_simd_memcpy(void *dest, void const *src, size_t bytes)
 3188-{
 3189-	char STBIR_SIMD_STREAMOUT_PTR(*) d = (char *)dest;
 3190-	char STBIR_SIMD_STREAMOUT_PTR(*) d_end = ((char *)dest) + bytes;
 3191-	ptrdiff_t ofs_to_src = (char *)src - (char *)dest;
 3192-
 3193-	// check overlaps
 3194-	STBIR_ASSERT(((d >= ((char *)src) + bytes)) ||
 3195-	             ((d + bytes) <= (char *)src));
 3196-
 3197-	if (bytes < (16 * stbir__simdfX_float_count)) {
 3198-		if (bytes < 16) {
 3199-			if (bytes) {
 3200-				STBIR_SIMD_NO_UNROLL_LOOP_START
 3201-				do {
 3202-					STBIR_SIMD_NO_UNROLL(d);
 3203-					d[0] = d[ofs_to_src];
 3204-					++d;
 3205-				} while (d < d_end);
 3206-			}
 3207-		} else {
 3208-			stbir__simdf x;
 3209-			// do one unaligned to get us aligned for the stream out below
 3210-			stbir__simdf_load(x, (d + ofs_to_src));
 3211-			stbir__simdf_store(d, x);
 3212-			d = (char *)((((size_t)d) + 16) & ~15);
 3213-
 3214-			STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
 3215-			for (;;) {
 3216-				STBIR_SIMD_NO_UNROLL(d);
 3217-
 3218-				if (d > (d_end - 16)) {
 3219-					if (d == d_end) {
 3220-						return;
 3221-					}
 3222-					d = d_end - 16;
 3223-				}
 3224-
 3225-				stbir__simdf_load(x, (d + ofs_to_src));
 3226-				stbir__simdf_store(d, x);
 3227-				d += 16;
 3228-			}
 3229-		}
 3230-	} else {
 3231-		stbir__simdfX x0, x1, x2, x3;
 3232-
 3233-		// do one unaligned to get us aligned for the stream out below
 3234-		stbir__simdfX_load(x0,
 3235-		                   (d + ofs_to_src) + 0 * stbir__simdfX_float_count);
 3236-		stbir__simdfX_load(x1,
 3237-		                   (d + ofs_to_src) + 4 * stbir__simdfX_float_count);
 3238-		stbir__simdfX_load(x2,
 3239-		                   (d + ofs_to_src) + 8 * stbir__simdfX_float_count);
 3240-		stbir__simdfX_load(x3,
 3241-		                   (d + ofs_to_src) + 12 * stbir__simdfX_float_count);
 3242-		stbir__simdfX_store(d + 0 * stbir__simdfX_float_count, x0);
 3243-		stbir__simdfX_store(d + 4 * stbir__simdfX_float_count, x1);
 3244-		stbir__simdfX_store(d + 8 * stbir__simdfX_float_count, x2);
 3245-		stbir__simdfX_store(d + 12 * stbir__simdfX_float_count, x3);
 3246-		d = (char *)((((size_t)d) + (16 * stbir__simdfX_float_count)) &
 3247-		             ~((16 * stbir__simdfX_float_count) - 1));
 3248-
 3249-		STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
 3250-		for (;;) {
 3251-			STBIR_SIMD_NO_UNROLL(d);
 3252-
 3253-			if (d > (d_end - (16 * stbir__simdfX_float_count))) {
 3254-				if (d == d_end) {
 3255-					return;
 3256-				}
 3257-				d = d_end - (16 * stbir__simdfX_float_count);
 3258-			}
 3259-
 3260-			stbir__simdfX_load(x0, (d + ofs_to_src) +
 3261-			                           0 * stbir__simdfX_float_count);
 3262-			stbir__simdfX_load(x1, (d + ofs_to_src) +
 3263-			                           4 * stbir__simdfX_float_count);
 3264-			stbir__simdfX_load(x2, (d + ofs_to_src) +
 3265-			                           8 * stbir__simdfX_float_count);
 3266-			stbir__simdfX_load(x3, (d + ofs_to_src) +
 3267-			                           12 * stbir__simdfX_float_count);
 3268-			stbir__simdfX_store(d + 0 * stbir__simdfX_float_count, x0);
 3269-			stbir__simdfX_store(d + 4 * stbir__simdfX_float_count, x1);
 3270-			stbir__simdfX_store(d + 8 * stbir__simdfX_float_count, x2);
 3271-			stbir__simdfX_store(d + 12 * stbir__simdfX_float_count, x3);
 3272-			d += (16 * stbir__simdfX_float_count);
 3273-		}
 3274-	}
 3275-}
 3276-
 3277-// memcpy that is specically intentionally overlapping (src is smaller then
 3278-// dest, so can be
 3279-//   a normal forward copy, bytes is divisible by 4 and bytes is greater than or
 3280-//   equal to the diff between dest and src)
 3281-static void
 3282-stbir_overlapping_memcpy(void *dest, void const *src, size_t bytes)
 3283-{
 3284-	char STBIR_SIMD_STREAMOUT_PTR(*) sd = (char *)src;
 3285-	char STBIR_SIMD_STREAMOUT_PTR(*) s_end = ((char *)src) + bytes;
 3286-	ptrdiff_t ofs_to_dest = (char *)dest - (char *)src;
 3287-
 3288-	if (ofs_to_dest >= 16) // is the overlap more than 16 away?
 3289-	{
 3290-		char STBIR_SIMD_STREAMOUT_PTR(*) s_end16 =
 3291-		    ((char *)src) + (bytes & ~15);
 3292-		STBIR_SIMD_NO_UNROLL_LOOP_START
 3293-		do {
 3294-			stbir__simdf x;
 3295-			STBIR_SIMD_NO_UNROLL(sd);
 3296-			stbir__simdf_load(x, sd);
 3297-			stbir__simdf_store((sd + ofs_to_dest), x);
 3298-			sd += 16;
 3299-		} while (sd < s_end16);
 3300-
 3301-		if (sd == s_end) {
 3302-			return;
 3303-		}
 3304-	}
 3305-
 3306-	do {
 3307-		STBIR_SIMD_NO_UNROLL(sd);
 3308-		*(int *)(sd + ofs_to_dest) = *(int *)sd;
 3309-		sd += 4;
 3310-	} while (sd < s_end);
 3311-}
 3312-
 3313-#else // no SSE2
 3314-
 3315-// when in scalar mode, we let unrolling happen, so this macro just does the
 3316-// __restrict
 3317-#define STBIR_SIMD_STREAMOUT_PTR(star) STBIR_STREAMOUT_PTR(star)
 3318-#define STBIR_SIMD_NO_UNROLL(ptr)
 3319-#define STBIR_SIMD_NO_UNROLL_LOOP_START
 3320-#define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
 3321-
 3322-#endif // SSE2
 3323-
 3324-#ifdef STBIR_PROFILE
 3325-
 3326-#ifndef STBIR_PROFILE_FUNC
 3327-
 3328-#if defined(_x86_64) || defined(__x86_64__) || defined(_M_X64) ||              \
 3329-    defined(__x86_64) || defined(__SSE2__) || defined(STBIR_SSE) ||            \
 3330-    defined(_M_IX86_FP) || defined(__i386) || defined(__i386__) ||             \
 3331-    defined(_M_IX86) || defined(_X86_)
 3332-
 3333-#ifdef _MSC_VER
 3334-
 3335-STBIRDEF stbir_uint64
 3336-__rdtsc();
 3337-#define STBIR_PROFILE_FUNC() __rdtsc()
 3338-
 3339-#else // non msvc
 3340-
 3341-static stbir__inline stbir_uint64
 3342-STBIR_PROFILE_FUNC()
 3343-{
 3344-	stbir_uint32 lo, hi;
 3345-	asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
 3346-	return (((stbir_uint64)hi) << 32) | ((stbir_uint64)lo);
 3347-}
 3348-
 3349-#endif // msvc
 3350-
 3351-#elif defined(_M_ARM64) || defined(__aarch64__) || defined(__arm64__) ||       \
 3352-    defined(__ARM_NEON__)
 3353-
 3354-#if defined(_MSC_VER) && !defined(__clang__)
 3355-
 3356-#define STBIR_PROFILE_FUNC() _ReadStatusReg(ARM64_CNTVCT)
 3357-
 3358-#else
 3359-
 3360-static stbir__inline stbir_uint64
 3361-STBIR_PROFILE_FUNC()
 3362-{
 3363-	stbir_uint64 tsc;
 3364-	asm volatile("mrs %0, cntvct_el0" : "=r"(tsc));
 3365-	return tsc;
 3366-}
 3367-
 3368-#endif
 3369-
 3370-#else // x64, arm
 3371-
 3372-#error Unknown platform for profiling.
 3373-
 3374-#endif // x64, arm
 3375-
 3376-#endif // STBIR_PROFILE_FUNC
 3377-
 3378-#define STBIR_ONLY_PROFILE_GET_SPLIT_INFO , stbir__per_split_info *split_info
 3379-#define STBIR_ONLY_PROFILE_SET_SPLIT_INFO , split_info
 3380-
 3381-#define STBIR_ONLY_PROFILE_BUILD_GET_INFO , stbir__info *profile_info
 3382-#define STBIR_ONLY_PROFILE_BUILD_SET_INFO , profile_info
 3383-
 3384-// super light-weight micro profiler
 3385-#define STBIR_PROFILE_START_ll(info, wh)                                       \
 3386-	{                                                                          \
 3387-		stbir_uint64 wh##thiszonetime = STBIR_PROFILE_FUNC();                  \
 3388-		stbir_uint64 *wh##save_parent_excluded_ptr =                           \
 3389-		    info->current_zone_excluded_ptr;                                   \
 3390-		stbir_uint64 wh##current_zone_excluded = 0;                            \
 3391-		info->current_zone_excluded_ptr = &wh##current_zone_excluded;
 3392-#define STBIR_PROFILE_END_ll(info, wh)                                         \
 3393-	wh##thiszonetime = STBIR_PROFILE_FUNC() - wh##thiszonetime;                \
 3394-	info->profile.named.wh += wh##thiszonetime - wh##current_zone_excluded;    \
 3395-	*wh##save_parent_excluded_ptr += wh##thiszonetime;                         \
 3396-	info->current_zone_excluded_ptr = wh##save_parent_excluded_ptr;            \
 3397-	}
 3398-#define STBIR_PROFILE_FIRST_START_ll(info, wh)                                 \
 3399-	{                                                                          \
 3400-		int i;                                                                 \
 3401-		info->current_zone_excluded_ptr = &info->profile.named.total;          \
 3402-		for (i = 0; i < STBIR__ARRAY_SIZE(info->profile.array); i++)           \
 3403-			info->profile.array[i] = 0;                                        \
 3404-	}                                                                          \
 3405-	STBIR_PROFILE_START_ll(info, wh);
 3406-#define STBIR_PROFILE_CLEAR_EXTRAS_ll(info, num)                               \
 3407-	{                                                                          \
 3408-		int extra;                                                             \
 3409-		for (extra = 1; extra < (num); extra++) {                              \
 3410-			int i;                                                             \
 3411-			for (i = 0; i < STBIR__ARRAY_SIZE((info)->profile.array); i++)     \
 3412-				(info)[extra].profile.array[i] = 0;                            \
 3413-		}                                                                      \
 3414-	}
 3415-
 3416-// for thread data
 3417-#define STBIR_PROFILE_START(wh) STBIR_PROFILE_START_ll(split_info, wh)
 3418-#define STBIR_PROFILE_END(wh) STBIR_PROFILE_END_ll(split_info, wh)
 3419-#define STBIR_PROFILE_FIRST_START(wh)                                          \
 3420-	STBIR_PROFILE_FIRST_START_ll(split_info, wh)
 3421-#define STBIR_PROFILE_CLEAR_EXTRAS()                                           \
 3422-	STBIR_PROFILE_CLEAR_EXTRAS_ll(split_info, split_count)
 3423-
 3424-// for build data
 3425-#define STBIR_PROFILE_BUILD_START(wh) STBIR_PROFILE_START_ll(profile_info, wh)
 3426-#define STBIR_PROFILE_BUILD_END(wh) STBIR_PROFILE_END_ll(profile_info, wh)
 3427-#define STBIR_PROFILE_BUILD_FIRST_START(wh)                                    \
 3428-	STBIR_PROFILE_FIRST_START_ll(profile_info, wh)
 3429-#define STBIR_PROFILE_BUILD_CLEAR(info)                                        \
 3430-	{                                                                          \
 3431-		int i;                                                                 \
 3432-		for (i = 0; i < STBIR__ARRAY_SIZE(info->profile.array); i++)           \
 3433-			info->profile.array[i] = 0;                                        \
 3434-	}
 3435-
 3436-#else // no profile
 3437-
 3438-#define STBIR_ONLY_PROFILE_GET_SPLIT_INFO
 3439-#define STBIR_ONLY_PROFILE_SET_SPLIT_INFO
 3440-
 3441-#define STBIR_ONLY_PROFILE_BUILD_GET_INFO
 3442-#define STBIR_ONLY_PROFILE_BUILD_SET_INFO
 3443-
 3444-#define STBIR_PROFILE_START(wh)
 3445-#define STBIR_PROFILE_END(wh)
 3446-#define STBIR_PROFILE_FIRST_START(wh)
 3447-#define STBIR_PROFILE_CLEAR_EXTRAS()
 3448-
 3449-#define STBIR_PROFILE_BUILD_START(wh)
 3450-#define STBIR_PROFILE_BUILD_END(wh)
 3451-#define STBIR_PROFILE_BUILD_FIRST_START(wh)
 3452-#define STBIR_PROFILE_BUILD_CLEAR(info)
 3453-
 3454-#endif // stbir_profile
 3455-
 3456-#ifndef STBIR_CEILF
 3457-#include <math.h>
 3458-#if _MSC_VER <= 1200 // support VC6 for Sean
 3459-#define STBIR_CEILF(x) ((float)ceil((float)(x)))
 3460-#define STBIR_FLOORF(x) ((float)floor((float)(x)))
 3461-#else
 3462-#define STBIR_CEILF(x) ceilf(x)
 3463-#define STBIR_FLOORF(x) floorf(x)
 3464-#endif
 3465-#endif
 3466-
 3467-#ifndef STBIR_MEMCPY
 3468-// For memcpy
 3469-#include <string.h>
 3470-#define STBIR_MEMCPY(dest, src, len) memcpy(dest, src, len)
 3471-#endif
 3472-
 3473-#ifndef STBIR_SIMD
 3474-
 3475-// memcpy that is specifically intentionally overlapping (src is smaller then
 3476-// dest, so can be
 3477-//   a normal forward copy, bytes is divisible by 4 and bytes is greater than or
 3478-//   equal to the diff between dest and src)
 3479-static void
 3480-stbir_overlapping_memcpy(void *dest, void const *src, size_t bytes)
 3481-{
 3482-	char STBIR_SIMD_STREAMOUT_PTR(*) sd = (char *)src;
 3483-	char STBIR_SIMD_STREAMOUT_PTR(*) s_end = ((char *)src) + bytes;
 3484-	ptrdiff_t ofs_to_dest = (char *)dest - (char *)src;
 3485-
 3486-	if (ofs_to_dest >= 8) // is the overlap more than 8 away?
 3487-	{
 3488-		char STBIR_SIMD_STREAMOUT_PTR(*) s_end8 = ((char *)src) + (bytes & ~7);
 3489-		STBIR_NO_UNROLL_LOOP_START
 3490-		do {
 3491-			STBIR_NO_UNROLL(sd);
 3492-			*(stbir_uint64 *)(sd + ofs_to_dest) = *(stbir_uint64 *)sd;
 3493-			sd += 8;
 3494-		} while (sd < s_end8);
 3495-
 3496-		if (sd == s_end) {
 3497-			return;
 3498-		}
 3499-	}
 3500-
 3501-	STBIR_NO_UNROLL_LOOP_START
 3502-	do {
 3503-		STBIR_NO_UNROLL(sd);
 3504-		*(int *)(sd + ofs_to_dest) = *(int *)sd;
 3505-		sd += 4;
 3506-	} while (sd < s_end);
 3507-}
 3508-
 3509-#endif
 3510-
 3511-static float
 3512-stbir__filter_trapezoid(float x, float scale, void *user_data)
 3513-{
 3514-	float halfscale = scale / 2;
 3515-	float t = 0.5f + halfscale;
 3516-	STBIR_ASSERT(scale <= 1);
 3517-	STBIR__UNUSED(user_data);
 3518-
 3519-	if (x < 0.0f) {
 3520-		x = -x;
 3521-	}
 3522-
 3523-	if (x >= t) {
 3524-		return 0.0f;
 3525-	} else {
 3526-		float r = 0.5f - halfscale;
 3527-		if (x <= r) {
 3528-			return 1.0f;
 3529-		} else {
 3530-			return (t - x) / scale;
 3531-		}
 3532-	}
 3533-}
 3534-
 3535-static float
 3536-stbir__support_trapezoid(float scale, void *user_data)
 3537-{
 3538-	STBIR__UNUSED(user_data);
 3539-	return 0.5f + scale / 2.0f;
 3540-}
 3541-
 3542-static float
 3543-stbir__filter_triangle(float x, float s, void *user_data)
 3544-{
 3545-	STBIR__UNUSED(s);
 3546-	STBIR__UNUSED(user_data);
 3547-
 3548-	if (x < 0.0f) {
 3549-		x = -x;
 3550-	}
 3551-
 3552-	if (x <= 1.0f) {
 3553-		return 1.0f - x;
 3554-	} else {
 3555-		return 0.0f;
 3556-	}
 3557-}
 3558-
 3559-static float
 3560-stbir__filter_point(float x, float s, void *user_data)
 3561-{
 3562-	STBIR__UNUSED(x);
 3563-	STBIR__UNUSED(s);
 3564-	STBIR__UNUSED(user_data);
 3565-
 3566-	return 1.0f;
 3567-}
 3568-
 3569-static float
 3570-stbir__filter_cubic(float x, float s, void *user_data)
 3571-{
 3572-	STBIR__UNUSED(s);
 3573-	STBIR__UNUSED(user_data);
 3574-
 3575-	if (x < 0.0f) {
 3576-		x = -x;
 3577-	}
 3578-
 3579-	if (x < 1.0f) {
 3580-		return (4.0f + x * x * (3.0f * x - 6.0f)) / 6.0f;
 3581-	} else if (x < 2.0f) {
 3582-		return (8.0f + x * (-12.0f + x * (6.0f - x))) / 6.0f;
 3583-	}
 3584-
 3585-	return (0.0f);
 3586-}
 3587-
 3588-static float
 3589-stbir__filter_catmullrom(float x, float s, void *user_data)
 3590-{
 3591-	STBIR__UNUSED(s);
 3592-	STBIR__UNUSED(user_data);
 3593-
 3594-	if (x < 0.0f) {
 3595-		x = -x;
 3596-	}
 3597-
 3598-	if (x < 1.0f) {
 3599-		return 1.0f - x * x * (2.5f - 1.5f * x);
 3600-	} else if (x < 2.0f) {
 3601-		return 2.0f - x * (4.0f + x * (0.5f * x - 2.5f));
 3602-	}
 3603-
 3604-	return (0.0f);
 3605-}
 3606-
 3607-static float
 3608-stbir__filter_mitchell(float x, float s, void *user_data)
 3609-{
 3610-	STBIR__UNUSED(s);
 3611-	STBIR__UNUSED(user_data);
 3612-
 3613-	if (x < 0.0f) {
 3614-		x = -x;
 3615-	}
 3616-
 3617-	if (x < 1.0f) {
 3618-		return (16.0f + x * x * (21.0f * x - 36.0f)) / 18.0f;
 3619-	} else if (x < 2.0f) {
 3620-		return (32.0f + x * (-60.0f + x * (36.0f - 7.0f * x))) / 18.0f;
 3621-	}
 3622-
 3623-	return (0.0f);
 3624-}
 3625-
 3626-static float
 3627-stbir__support_zeropoint5(float s, void *user_data)
 3628-{
 3629-	STBIR__UNUSED(s);
 3630-	STBIR__UNUSED(user_data);
 3631-	return 0.5f;
 3632-}
 3633-
 3634-static float
 3635-stbir__support_one(float s, void *user_data)
 3636-{
 3637-	STBIR__UNUSED(s);
 3638-	STBIR__UNUSED(user_data);
 3639-	return 1;
 3640-}
 3641-
 3642-static float
 3643-stbir__support_two(float s, void *user_data)
 3644-{
 3645-	STBIR__UNUSED(s);
 3646-	STBIR__UNUSED(user_data);
 3647-	return 2;
 3648-}
 3649-
 3650-// This is the maximum number of input samples that can affect an output sample
 3651-// with the given filter from the output pixel's perspective
 3652-static int
 3653-stbir__get_filter_pixel_width(stbir__support_callback *support, float scale,
 3654-                              void *user_data)
 3655-{
 3656-	STBIR_ASSERT(support != 0);
 3657-
 3658-	if (scale >= (1.0f - stbir__small_float)) { // upscale
 3659-		return (int)STBIR_CEILF(support(1.0f / scale, user_data) * 2.0f);
 3660-	} else {
 3661-		return (int)STBIR_CEILF(support(scale, user_data) * 2.0f / scale);
 3662-	}
 3663-}
 3664-
 3665-// this is how many coefficents per run of the filter (which is different
 3666-//   from the filter_pixel_width depending on if we are scattering or gathering)
 3667-static int
 3668-stbir__get_coefficient_width(stbir__sampler *samp, int is_gather,
 3669-                             void *user_data)
 3670-{
 3671-	float scale = samp->scale_info.scale;
 3672-	stbir__support_callback *support = samp->filter_support;
 3673-
 3674-	switch (is_gather) {
 3675-	case 1:
 3676-		return (int)STBIR_CEILF(support(1.0f / scale, user_data) * 2.0f);
 3677-	case 2:
 3678-		return (int)STBIR_CEILF(support(scale, user_data) * 2.0f / scale);
 3679-	case 0:
 3680-		return (int)STBIR_CEILF(support(scale, user_data) * 2.0f);
 3681-	default:
 3682-		STBIR_ASSERT((is_gather >= 0) && (is_gather <= 2));
 3683-		return 0;
 3684-	}
 3685-}
 3686-
 3687-static int
 3688-stbir__get_contributors(stbir__sampler *samp, int is_gather)
 3689-{
 3690-	if (is_gather) {
 3691-		return samp->scale_info.output_sub_size;
 3692-	} else {
 3693-		return (samp->scale_info.input_full_size +
 3694-		        samp->filter_pixel_margin * 2);
 3695-	}
 3696-}
 3697-
 3698-static int
 3699-stbir__edge_zero_full(int n, int max)
 3700-{
 3701-	STBIR__UNUSED(n);
 3702-	STBIR__UNUSED(max);
 3703-	return 0; // NOTREACHED
 3704-}
 3705-
 3706-static int
 3707-stbir__edge_clamp_full(int n, int max)
 3708-{
 3709-	if (n < 0) {
 3710-		return 0;
 3711-	}
 3712-
 3713-	if (n >= max) {
 3714-		return max - 1;
 3715-	}
 3716-
 3717-	return n; // NOTREACHED
 3718-}
 3719-
 3720-static int
 3721-stbir__edge_reflect_full(int n, int max)
 3722-{
 3723-	if (n < 0) {
 3724-		if (n > -max) {
 3725-			return -n;
 3726-		} else {
 3727-			return max - 1;
 3728-		}
 3729-	}
 3730-
 3731-	if (n >= max) {
 3732-		int max2 = max * 2;
 3733-		if (n >= max2) {
 3734-			return 0;
 3735-		} else {
 3736-			return max2 - n - 1;
 3737-		}
 3738-	}
 3739-
 3740-	return n; // NOTREACHED
 3741-}
 3742-
 3743-static int
 3744-stbir__edge_wrap_full(int n, int max)
 3745-{
 3746-	if (n >= 0) {
 3747-		return (n % max);
 3748-	} else {
 3749-		int m = (-n) % max;
 3750-
 3751-		if (m != 0) {
 3752-			m = max - m;
 3753-		}
 3754-
 3755-		return (m);
 3756-	}
 3757-}
 3758-
 3759-typedef int
 3760-stbir__edge_wrap_func(int n, int max);
 3761-static stbir__edge_wrap_func *stbir__edge_wrap_slow[] = {
 3762-    stbir__edge_clamp_full,   // STBIR_EDGE_CLAMP
 3763-    stbir__edge_reflect_full, // STBIR_EDGE_REFLECT
 3764-    stbir__edge_wrap_full,    // STBIR_EDGE_WRAP
 3765-    stbir__edge_zero_full,    // STBIR_EDGE_ZERO
 3766-};
 3767-
 3768-stbir__inline static int
 3769-stbir__edge_wrap(stbir_edge edge, int n, int max)
 3770-{
 3771-	// avoid per-pixel switch
 3772-	if (n >= 0 && n < max) {
 3773-		return n;
 3774-	}
 3775-	return stbir__edge_wrap_slow[edge](n, max);
 3776-}
 3777-
 3778-#define STBIR__MERGE_RUNS_PIXEL_THRESHOLD 16
 3779-
 3780-// get information on the extents of a sampler
 3781-static void
 3782-stbir__get_extents(stbir__sampler *samp, stbir__extents *scanline_extents)
 3783-{
 3784-	int j, stop;
 3785-	int left_margin, right_margin;
 3786-	int min_n = 0x7fffffff, max_n = -0x7fffffff;
 3787-	int min_left = 0x7fffffff, max_left = -0x7fffffff;
 3788-	int min_right = 0x7fffffff, max_right = -0x7fffffff;
 3789-	stbir_edge edge = samp->edge;
 3790-	stbir__contributors *contributors = samp->contributors;
 3791-	int output_sub_size = samp->scale_info.output_sub_size;
 3792-	int input_full_size = samp->scale_info.input_full_size;
 3793-	int filter_pixel_margin = samp->filter_pixel_margin;
 3794-
 3795-	STBIR_ASSERT(samp->is_gather);
 3796-
 3797-	stop = output_sub_size;
 3798-	for (j = 0; j < stop; j++) {
 3799-		STBIR_ASSERT(contributors[j].n1 >= contributors[j].n0);
 3800-		if (contributors[j].n0 < min_n) {
 3801-			min_n = contributors[j].n0;
 3802-			stop = j + filter_pixel_margin; // if we find a new min, only scan
 3803-			                                // another filter width
 3804-			if (stop > output_sub_size) {
 3805-				stop = output_sub_size;
 3806-			}
 3807-		}
 3808-	}
 3809-
 3810-	stop = 0;
 3811-	for (j = output_sub_size - 1; j >= stop; j--) {
 3812-		STBIR_ASSERT(contributors[j].n1 >= contributors[j].n0);
 3813-		if (contributors[j].n1 > max_n) {
 3814-			max_n = contributors[j].n1;
 3815-			stop = j - filter_pixel_margin; // if we find a new max, only scan
 3816-			                                // another filter width
 3817-			if (stop < 0) {
 3818-				stop = 0;
 3819-			}
 3820-		}
 3821-	}
 3822-
 3823-	STBIR_ASSERT(scanline_extents->conservative.n0 <= min_n);
 3824-	STBIR_ASSERT(scanline_extents->conservative.n1 >= max_n);
 3825-
 3826-	// now calculate how much into the margins we really read
 3827-	left_margin = 0;
 3828-	if (min_n < 0) {
 3829-		left_margin = -min_n;
 3830-		min_n = 0;
 3831-	}
 3832-
 3833-	right_margin = 0;
 3834-	if (max_n >= input_full_size) {
 3835-		right_margin = max_n - input_full_size + 1;
 3836-		max_n = input_full_size - 1;
 3837-	}
 3838-
 3839-	// index 1 is margin pixel extents (how many pixels we hang over the edge)
 3840-	scanline_extents->edge_sizes[0] = left_margin;
 3841-	scanline_extents->edge_sizes[1] = right_margin;
 3842-
 3843-	// index 2 is pixels read from the input
 3844-	scanline_extents->spans[0].n0 = min_n;
 3845-	scanline_extents->spans[0].n1 = max_n;
 3846-	scanline_extents->spans[0].pixel_offset_for_input = min_n;
 3847-
 3848-	// default to no other input range
 3849-	scanline_extents->spans[1].n0 = 0;
 3850-	scanline_extents->spans[1].n1 = -1;
 3851-	scanline_extents->spans[1].pixel_offset_for_input = 0;
 3852-
 3853-	// don't have to do edge calc for zero clamp
 3854-	if (edge == STBIR_EDGE_ZERO) {
 3855-		return;
 3856-	}
 3857-
 3858-	// convert margin pixels to the pixels within the input (min and max)
 3859-	for (j = -left_margin; j < 0; j++) {
 3860-		int p = stbir__edge_wrap(edge, j, input_full_size);
 3861-		if (p < min_left) {
 3862-			min_left = p;
 3863-		}
 3864-		if (p > max_left) {
 3865-			max_left = p;
 3866-		}
 3867-	}
 3868-
 3869-	for (j = input_full_size; j < (input_full_size + right_margin); j++) {
 3870-		int p = stbir__edge_wrap(edge, j, input_full_size);
 3871-		if (p < min_right) {
 3872-			min_right = p;
 3873-		}
 3874-		if (p > max_right) {
 3875-			max_right = p;
 3876-		}
 3877-	}
 3878-
 3879-	// merge the left margin pixel region if it connects within 4 pixels of main
 3880-	// pixel region
 3881-	if (min_left != 0x7fffffff) {
 3882-		if (((min_left <= min_n) &&
 3883-		     ((max_left + STBIR__MERGE_RUNS_PIXEL_THRESHOLD) >= min_n)) ||
 3884-		    ((min_n <= min_left) &&
 3885-		     ((max_n + STBIR__MERGE_RUNS_PIXEL_THRESHOLD) >= max_left))) {
 3886-			scanline_extents->spans[0].n0 = min_n = stbir__min(min_n, min_left);
 3887-			scanline_extents->spans[0].n1 = max_n = stbir__max(max_n, max_left);
 3888-			scanline_extents->spans[0].pixel_offset_for_input = min_n;
 3889-			left_margin = 0;
 3890-		}
 3891-	}
 3892-
 3893-	// merge the right margin pixel region if it connects within 4 pixels of
 3894-	// main pixel region
 3895-	if (min_right != 0x7fffffff) {
 3896-		if (((min_right <= min_n) &&
 3897-		     ((max_right + STBIR__MERGE_RUNS_PIXEL_THRESHOLD) >= min_n)) ||
 3898-		    ((min_n <= min_right) &&
 3899-		     ((max_n + STBIR__MERGE_RUNS_PIXEL_THRESHOLD) >= max_right))) {
 3900-			scanline_extents->spans[0].n0 = min_n =
 3901-			    stbir__min(min_n, min_right);
 3902-			scanline_extents->spans[0].n1 = max_n =
 3903-			    stbir__max(max_n, max_right);
 3904-			scanline_extents->spans[0].pixel_offset_for_input = min_n;
 3905-			right_margin = 0;
 3906-		}
 3907-	}
 3908-
 3909-	STBIR_ASSERT(scanline_extents->conservative.n0 <= min_n);
 3910-	STBIR_ASSERT(scanline_extents->conservative.n1 >= max_n);
 3911-
 3912-	// you get two ranges when you have the WRAP edge mode and you are doing
 3913-	// just the a piece of the resize
 3914-	//   so you need to get a second run of pixels from the opposite side of the
 3915-	//   scanline (which you wouldn't need except for WRAP)
 3916-
 3917-	// if we can't merge the min_left range, add it as a second range
 3918-	if ((left_margin) && (min_left != 0x7fffffff)) {
 3919-		stbir__span *newspan = scanline_extents->spans + 1;
 3920-		STBIR_ASSERT(right_margin == 0);
 3921-		if (min_left < scanline_extents->spans[0].n0) {
 3922-			scanline_extents->spans[1].pixel_offset_for_input =
 3923-			    scanline_extents->spans[0].n0;
 3924-			scanline_extents->spans[1].n0 = scanline_extents->spans[0].n0;
 3925-			scanline_extents->spans[1].n1 = scanline_extents->spans[0].n1;
 3926-			--newspan;
 3927-		}
 3928-		newspan->pixel_offset_for_input = min_left;
 3929-		newspan->n0 = -left_margin;
 3930-		newspan->n1 = (max_left - min_left) - left_margin;
 3931-		scanline_extents->edge_sizes[0] =
 3932-		    0; // don't need to copy the left margin, since we are directly
 3933-		       // decoding into the margin
 3934-	}
 3935-	// if we can't merge the min_left range, add it as a second range
 3936-	else if ((right_margin) && (min_right != 0x7fffffff)) {
 3937-		stbir__span *newspan = scanline_extents->spans + 1;
 3938-		if (min_right < scanline_extents->spans[0].n0) {
 3939-			scanline_extents->spans[1].pixel_offset_for_input =
 3940-			    scanline_extents->spans[0].n0;
 3941-			scanline_extents->spans[1].n0 = scanline_extents->spans[0].n0;
 3942-			scanline_extents->spans[1].n1 = scanline_extents->spans[0].n1;
 3943-			--newspan;
 3944-		}
 3945-		newspan->pixel_offset_for_input = min_right;
 3946-		newspan->n0 = scanline_extents->spans[1].n1 + 1;
 3947-		newspan->n1 =
 3948-		    scanline_extents->spans[1].n1 + 1 + (max_right - min_right);
 3949-		scanline_extents->edge_sizes[1] =
 3950-		    0; // don't need to copy the right margin, since we are directly
 3951-		       // decoding into the margin
 3952-	}
 3953-
 3954-	// sort the spans into write output order
 3955-	if ((scanline_extents->spans[1].n1 > scanline_extents->spans[1].n0) &&
 3956-	    (scanline_extents->spans[0].n0 > scanline_extents->spans[1].n0)) {
 3957-		stbir__span tspan = scanline_extents->spans[0];
 3958-		scanline_extents->spans[0] = scanline_extents->spans[1];
 3959-		scanline_extents->spans[1] = tspan;
 3960-	}
 3961-}
 3962-
 3963-static void
 3964-stbir__calculate_in_pixel_range(int *first_pixel, int *last_pixel,
 3965-                                float out_pixel_center, float out_filter_radius,
 3966-                                float inv_scale, float out_shift,
 3967-                                int input_size, stbir_edge edge)
 3968-{
 3969-	int first, last;
 3970-	float out_pixel_influence_lowerbound = out_pixel_center - out_filter_radius;
 3971-	float out_pixel_influence_upperbound = out_pixel_center + out_filter_radius;
 3972-
 3973-	float in_pixel_influence_lowerbound =
 3974-	    (out_pixel_influence_lowerbound + out_shift) * inv_scale;
 3975-	float in_pixel_influence_upperbound =
 3976-	    (out_pixel_influence_upperbound + out_shift) * inv_scale;
 3977-
 3978-	first = (int)(STBIR_FLOORF(in_pixel_influence_lowerbound + 0.5f));
 3979-	last = (int)(STBIR_FLOORF(in_pixel_influence_upperbound - 0.5f));
 3980-	if (last < first) {
 3981-		last = first; // point sample mode can span a value *right* at 0.5, and
 3982-		              // cause these to cross
 3983-	}
 3984-
 3985-	if (edge == STBIR_EDGE_WRAP) {
 3986-		if (first < -input_size) {
 3987-			first = -input_size;
 3988-		}
 3989-		if (last >= (input_size * 2)) {
 3990-			last = (input_size * 2) - 1;
 3991-		}
 3992-	}
 3993-
 3994-	*first_pixel = first;
 3995-	*last_pixel = last;
 3996-}
 3997-
 3998-static void
 3999-stbir__calculate_coefficients_for_gather_upsample(
 4000-    float out_filter_radius, stbir__kernel_callback *kernel,
 4001-    stbir__scale_info *scale_info, int num_contributors,
 4002-    stbir__contributors *contributors, float *coefficient_group,
 4003-    int coefficient_width, stbir_edge edge, void *user_data)
 4004-{
 4005-	int n, end;
 4006-	float inv_scale = scale_info->inv_scale;
 4007-	float out_shift = scale_info->pixel_shift;
 4008-	int input_size = scale_info->input_full_size;
 4009-	int numerator = scale_info->scale_numerator;
 4010-	int polyphase =
 4011-	    ((scale_info->scale_is_rational) && (numerator < num_contributors));
 4012-
 4013-	// Looping through out pixels
 4014-	end = num_contributors;
 4015-	if (polyphase) {
 4016-		end = numerator;
 4017-	}
 4018-	for (n = 0; n < end; n++) {
 4019-		int i;
 4020-		int last_non_zero;
 4021-		float out_pixel_center = (float)n + 0.5f;
 4022-		float in_center_of_out = (out_pixel_center + out_shift) * inv_scale;
 4023-
 4024-		int in_first_pixel, in_last_pixel;
 4025-
 4026-		stbir__calculate_in_pixel_range(&in_first_pixel, &in_last_pixel,
 4027-		                                out_pixel_center, out_filter_radius,
 4028-		                                inv_scale, out_shift, input_size, edge);
 4029-
 4030-		// make sure we never generate a range larger than our precalculated
 4031-		// coeff width
 4032-		//   this only happens in point sample mode, but it's a good safe thing
 4033-		//   to do anyway
 4034-		if ((in_last_pixel - in_first_pixel + 1) > coefficient_width) {
 4035-			in_last_pixel = in_first_pixel + coefficient_width - 1;
 4036-		}
 4037-
 4038-		last_non_zero = -1;
 4039-		for (i = 0; i <= in_last_pixel - in_first_pixel; i++) {
 4040-			float in_pixel_center = (float)(i + in_first_pixel) + 0.5f;
 4041-			float coeff = kernel(in_center_of_out - in_pixel_center, inv_scale,
 4042-			                     user_data);
 4043-
 4044-			// kill denormals
 4045-			if (((coeff < stbir__small_float) &&
 4046-			     (coeff > -stbir__small_float))) {
 4047-				if (i == 0) // if we're at the front, just eat zero contributors
 4048-				{
 4049-					STBIR_ASSERT((in_last_pixel - in_first_pixel) !=
 4050-					             0); // there should be at least one contrib
 4051-					++in_first_pixel;
 4052-					i--;
 4053-					continue;
 4054-				}
 4055-				coeff =
 4056-				    0; // make sure is fully zero (should keep denormals away)
 4057-			} else {
 4058-				last_non_zero = i;
 4059-			}
 4060-
 4061-			coefficient_group[i] = coeff;
 4062-		}
 4063-
 4064-		in_last_pixel = last_non_zero + in_first_pixel; // kills trailing zeros
 4065-		contributors->n0 = in_first_pixel;
 4066-		contributors->n1 = in_last_pixel;
 4067-
 4068-		STBIR_ASSERT(contributors->n1 >= contributors->n0);
 4069-
 4070-		++contributors;
 4071-		coefficient_group += coefficient_width;
 4072-	}
 4073-}
 4074-
 4075-static void
 4076-stbir__insert_coeff(stbir__contributors *contribs, float *coeffs, int new_pixel,
 4077-                    float new_coeff, int max_width)
 4078-{
 4079-	if (new_pixel <= contribs->n1) // before the end
 4080-	{
 4081-		if (new_pixel < contribs->n0) // before the front?
 4082-		{
 4083-			if ((contribs->n1 - new_pixel + 1) <= max_width) {
 4084-				int j, o = contribs->n0 - new_pixel;
 4085-				for (j = contribs->n1 - contribs->n0; j <= 0; j--) {
 4086-					coeffs[j + o] = coeffs[j];
 4087-				}
 4088-				for (j = 1; j < o; j--) {
 4089-					coeffs[j] = coeffs[0];
 4090-				}
 4091-				coeffs[0] = new_coeff;
 4092-				contribs->n0 = new_pixel;
 4093-			}
 4094-		} else {
 4095-			coeffs[new_pixel - contribs->n0] += new_coeff;
 4096-		}
 4097-	} else {
 4098-		if ((new_pixel - contribs->n0 + 1) <= max_width) {
 4099-			int j, e = new_pixel - contribs->n0;
 4100-			for (j = (contribs->n1 - contribs->n0) + 1; j < e;
 4101-			     j++) { // clear in-betweens coeffs if there are any
 4102-				coeffs[j] = 0;
 4103-			}
 4104-
 4105-			coeffs[e] = new_coeff;
 4106-			contribs->n1 = new_pixel;
 4107-		}
 4108-	}
 4109-}
 4110-
 4111-static void
 4112-stbir__calculate_out_pixel_range(int *first_pixel, int *last_pixel,
 4113-                                 float in_pixel_center, float in_pixels_radius,
 4114-                                 float scale, float out_shift, int out_size)
 4115-{
 4116-	float in_pixel_influence_lowerbound = in_pixel_center - in_pixels_radius;
 4117-	float in_pixel_influence_upperbound = in_pixel_center + in_pixels_radius;
 4118-	float out_pixel_influence_lowerbound =
 4119-	    in_pixel_influence_lowerbound * scale - out_shift;
 4120-	float out_pixel_influence_upperbound =
 4121-	    in_pixel_influence_upperbound * scale - out_shift;
 4122-	int out_first_pixel =
 4123-	    (int)(STBIR_FLOORF(out_pixel_influence_lowerbound + 0.5f));
 4124-	int out_last_pixel =
 4125-	    (int)(STBIR_FLOORF(out_pixel_influence_upperbound - 0.5f));
 4126-
 4127-	if (out_first_pixel < 0) {
 4128-		out_first_pixel = 0;
 4129-	}
 4130-	if (out_last_pixel >= out_size) {
 4131-		out_last_pixel = out_size - 1;
 4132-	}
 4133-	*first_pixel = out_first_pixel;
 4134-	*last_pixel = out_last_pixel;
 4135-}
 4136-
 4137-static void
 4138-stbir__calculate_coefficients_for_gather_downsample(
 4139-    int start, int end, float in_pixels_radius, stbir__kernel_callback *kernel,
 4140-    stbir__scale_info *scale_info, int coefficient_width, int num_contributors,
 4141-    stbir__contributors *contributors, float *coefficient_group,
 4142-    void *user_data)
 4143-{
 4144-	int in_pixel;
 4145-	int i;
 4146-	int first_out_inited = -1;
 4147-	float scale = scale_info->scale;
 4148-	float out_shift = scale_info->pixel_shift;
 4149-	int out_size = scale_info->output_sub_size;
 4150-	int numerator = scale_info->scale_numerator;
 4151-	int polyphase = ((scale_info->scale_is_rational) && (numerator < out_size));
 4152-
 4153-	STBIR__UNUSED(num_contributors);
 4154-
 4155-	// Loop through the input pixels
 4156-	for (in_pixel = start; in_pixel < end; in_pixel++) {
 4157-		float in_pixel_center = (float)in_pixel + 0.5f;
 4158-		float out_center_of_in = in_pixel_center * scale - out_shift;
 4159-		int out_first_pixel, out_last_pixel;
 4160-
 4161-		stbir__calculate_out_pixel_range(&out_first_pixel, &out_last_pixel,
 4162-		                                 in_pixel_center, in_pixels_radius,
 4163-		                                 scale, out_shift, out_size);
 4164-
 4165-		if (out_first_pixel > out_last_pixel) {
 4166-			continue;
 4167-		}
 4168-
 4169-		// clamp or exit if we are using polyphase filtering, and the limit is
 4170-		// up
 4171-		if (polyphase) {
 4172-			// when polyphase, you only have to do coeffs up to the numerator
 4173-			// count
 4174-			if (out_first_pixel == numerator) {
 4175-				break;
 4176-			}
 4177-
 4178-			// don't do any extra work, clamp last pixel at numerator too
 4179-			if (out_last_pixel >= numerator) {
 4180-				out_last_pixel = numerator - 1;
 4181-			}
 4182-		}
 4183-
 4184-		for (i = 0; i <= out_last_pixel - out_first_pixel; i++) {
 4185-			float out_pixel_center = (float)(i + out_first_pixel) + 0.5f;
 4186-			float x = out_pixel_center - out_center_of_in;
 4187-			float coeff = kernel(x, scale, user_data) * scale;
 4188-
 4189-			// kill the coeff if it's too small (avoid denormals)
 4190-			if (((coeff < stbir__small_float) &&
 4191-			     (coeff > -stbir__small_float))) {
 4192-				coeff = 0.0f;
 4193-			}
 4194-
 4195-			{
 4196-				int out = i + out_first_pixel;
 4197-				float *coeffs = coefficient_group + out * coefficient_width;
 4198-				stbir__contributors *contribs = contributors + out;
 4199-
 4200-				// is this the first time this output pixel has been seen?  Init
 4201-				// it.
 4202-				if (out > first_out_inited) {
 4203-					STBIR_ASSERT(
 4204-					    out == (first_out_inited +
 4205-					            1)); // ensure we have only advanced one at time
 4206-					first_out_inited = out;
 4207-					contribs->n0 = in_pixel;
 4208-					contribs->n1 = in_pixel;
 4209-					coeffs[0] = coeff;
 4210-				} else {
 4211-					// insert on end (always in order)
 4212-					if (coeffs[0] == 0.0f) // if the first coefficent is zero,
 4213-					                       // then zap it for this coeffs
 4214-					{
 4215-						STBIR_ASSERT(
 4216-						    (in_pixel - contribs->n0) ==
 4217-						    1); // ensure that when we zap, we're at the 2nd pos
 4218-						contribs->n0 = in_pixel;
 4219-					}
 4220-					contribs->n1 = in_pixel;
 4221-					STBIR_ASSERT((in_pixel - contribs->n0) < coefficient_width);
 4222-					coeffs[in_pixel - contribs->n0] = coeff;
 4223-				}
 4224-			}
 4225-		}
 4226-	}
 4227-}
 4228-
 4229-#ifdef STBIR_RENORMALIZE_IN_FLOAT
 4230-#define STBIR_RENORM_TYPE float
 4231-#else
 4232-#define STBIR_RENORM_TYPE double
 4233-#endif
 4234-
 4235-static void
 4236-stbir__cleanup_gathered_coefficients(stbir_edge edge,
 4237-                                     stbir__filter_extent_info *filter_info,
 4238-                                     stbir__scale_info *scale_info,
 4239-                                     int num_contributors,
 4240-                                     stbir__contributors *contributors,
 4241-                                     float *coefficient_group,
 4242-                                     int coefficient_width)
 4243-{
 4244-	int input_size = scale_info->input_full_size;
 4245-	int input_last_n1 = input_size - 1;
 4246-	int n, end;
 4247-	int lowest = 0x7fffffff;
 4248-	int highest = -0x7fffffff;
 4249-	int widest = -1;
 4250-	int numerator = scale_info->scale_numerator;
 4251-	int denominator = scale_info->scale_denominator;
 4252-	int polyphase =
 4253-	    ((scale_info->scale_is_rational) && (numerator < num_contributors));
 4254-	float *coeffs;
 4255-	stbir__contributors *contribs;
 4256-
 4257-	// weight all the coeffs for each sample
 4258-	coeffs = coefficient_group;
 4259-	contribs = contributors;
 4260-	end = num_contributors;
 4261-	if (polyphase) {
 4262-		end = numerator;
 4263-	}
 4264-	for (n = 0; n < end; n++) {
 4265-		int i;
 4266-		STBIR_RENORM_TYPE filter_scale, total_filter = 0;
 4267-		int e;
 4268-
 4269-		// add all contribs
 4270-		e = contribs->n1 - contribs->n0;
 4271-		for (i = 0; i <= e; i++) {
 4272-			total_filter += (STBIR_RENORM_TYPE)coeffs[i];
 4273-			STBIR_ASSERT((coeffs[i] >= -2.0f) &&
 4274-			             (coeffs[i] <= 2.0f)); // check for wonky weights
 4275-		}
 4276-
 4277-		// rescale
 4278-		if ((total_filter < stbir__small_float) &&
 4279-		    (total_filter > -stbir__small_float)) {
 4280-			// all coeffs are extremely small, just zero it
 4281-			contribs->n1 = contribs->n0;
 4282-			coeffs[0] = 0.0f;
 4283-		} else {
 4284-			// if the total isn't 1.0, rescale everything
 4285-			if ((total_filter < (1.0f - stbir__small_float)) ||
 4286-			    (total_filter > (1.0f + stbir__small_float))) {
 4287-				filter_scale = ((STBIR_RENORM_TYPE)1.0) / total_filter;
 4288-
 4289-				// scale them all
 4290-				for (i = 0; i <= e; i++) {
 4291-					coeffs[i] = (float)(coeffs[i] * filter_scale);
 4292-				}
 4293-			}
 4294-		}
 4295-		++contribs;
 4296-		coeffs += coefficient_width;
 4297-	}
 4298-
 4299-	// if we have a rational for the scale, we can exploit the polyphaseness to
 4300-	// not calculate
 4301-	//   most of the coefficients, so we copy them here
 4302-	if (polyphase) {
 4303-		stbir__contributors *prev_contribs = contributors;
 4304-		stbir__contributors *cur_contribs = contributors + numerator;
 4305-
 4306-		for (n = numerator; n < num_contributors; n++) {
 4307-			cur_contribs->n0 = prev_contribs->n0 + denominator;
 4308-			cur_contribs->n1 = prev_contribs->n1 + denominator;
 4309-			++cur_contribs;
 4310-			++prev_contribs;
 4311-		}
 4312-		stbir_overlapping_memcpy(coefficient_group +
 4313-		                             numerator * coefficient_width,
 4314-		                         coefficient_group,
 4315-		                         (num_contributors - numerator) *
 4316-		                             coefficient_width * sizeof(coeffs[0]));
 4317-	}
 4318-
 4319-	coeffs = coefficient_group;
 4320-	contribs = contributors;
 4321-
 4322-	for (n = 0; n < num_contributors; n++) {
 4323-		int i;
 4324-
 4325-		// in zero edge mode, just remove out of bounds contribs completely
 4326-		// (since their weights are accounted for now)
 4327-		if (edge == STBIR_EDGE_ZERO) {
 4328-			// shrink the right side if necessary
 4329-			if (contribs->n1 > input_last_n1) {
 4330-				contribs->n1 = input_last_n1;
 4331-			}
 4332-
 4333-			// shrink the left side
 4334-			if (contribs->n0 < 0) {
 4335-				int j, left, skips = 0;
 4336-
 4337-				skips = -contribs->n0;
 4338-				contribs->n0 = 0;
 4339-
 4340-				// now move down the weights
 4341-				left = contribs->n1 - contribs->n0 + 1;
 4342-				if (left > 0) {
 4343-					for (j = 0; j < left; j++) {
 4344-						coeffs[j] = coeffs[j + skips];
 4345-					}
 4346-				}
 4347-			}
 4348-		} else if ((edge == STBIR_EDGE_CLAMP) || (edge == STBIR_EDGE_REFLECT)) {
 4349-			// for clamp and reflect, calculate the true inbounds position
 4350-			// (based on edge type) and just add that to the existing weight
 4351-
 4352-			// right hand side first
 4353-			if (contribs->n1 > input_last_n1) {
 4354-				int start = contribs->n0;
 4355-				int endi = contribs->n1;
 4356-				contribs->n1 = input_last_n1;
 4357-				for (i = input_size; i <= endi; i++) {
 4358-					stbir__insert_coeff(
 4359-					    contribs, coeffs,
 4360-					    stbir__edge_wrap_slow[edge](i, input_size),
 4361-					    coeffs[i - start], coefficient_width);
 4362-				}
 4363-			}
 4364-
 4365-			// now check left hand edge
 4366-			if (contribs->n0 < 0) {
 4367-				int save_n0;
 4368-				float save_n0_coeff;
 4369-				float *c = coeffs - (contribs->n0 + 1);
 4370-
 4371-				// reinsert the coeffs with it reflected or clamped (insert
 4372-				// accumulates, if the coeffs exist)
 4373-				for (i = -1; i > contribs->n0; i--) {
 4374-					stbir__insert_coeff(
 4375-					    contribs, coeffs,
 4376-					    stbir__edge_wrap_slow[edge](i, input_size), *c--,
 4377-					    coefficient_width);
 4378-				}
 4379-				save_n0 = contribs->n0;
 4380-				save_n0_coeff = c[0]; // save it, since we didn't do the final
 4381-				                      // one (i==n0), because there might be too
 4382-				                      // many coeffs to hold (before we resize)!
 4383-
 4384-				// now slide all the coeffs down (since we have accumulated them
 4385-				// in the positive contribs) and reset the first contrib
 4386-				contribs->n0 = 0;
 4387-				for (i = 0; i <= contribs->n1; i++) {
 4388-					coeffs[i] = coeffs[i - save_n0];
 4389-				}
 4390-
 4391-				// now that we have shrunk down the contribs, we insert the
 4392-				// first one safely
 4393-				stbir__insert_coeff(
 4394-				    contribs, coeffs,
 4395-				    stbir__edge_wrap_slow[edge](save_n0, input_size),
 4396-				    save_n0_coeff, coefficient_width);
 4397-			}
 4398-		}
 4399-
 4400-		if (contribs->n0 <= contribs->n1) {
 4401-			int diff = contribs->n1 - contribs->n0 + 1;
 4402-			while (diff && (coeffs[diff - 1] == 0.0f)) {
 4403-				--diff;
 4404-			}
 4405-
 4406-			contribs->n1 = contribs->n0 + diff - 1;
 4407-
 4408-			if (contribs->n0 <= contribs->n1) {
 4409-				if (contribs->n0 < lowest) {
 4410-					lowest = contribs->n0;
 4411-				}
 4412-				if (contribs->n1 > highest) {
 4413-					highest = contribs->n1;
 4414-				}
 4415-				if (diff > widest) {
 4416-					widest = diff;
 4417-				}
 4418-			}
 4419-
 4420-			// re-zero out unused coefficients (if any)
 4421-			for (i = diff; i < coefficient_width; i++) {
 4422-				coeffs[i] = 0.0f;
 4423-			}
 4424-		}
 4425-
 4426-		++contribs;
 4427-		coeffs += coefficient_width;
 4428-	}
 4429-	filter_info->lowest = lowest;
 4430-	filter_info->highest = highest;
 4431-	filter_info->widest = widest;
 4432-}
 4433-
 4434-#undef STBIR_RENORM_TYPE
 4435-
 4436-static int
 4437-stbir__pack_coefficients(int num_contributors,
 4438-                         stbir__contributors *contributors, float *coefficents,
 4439-                         int coefficient_width, int widest, int row0, int row1)
 4440-{
 4441-#define STBIR_MOVE_1(dest, src)                                                \
 4442-	{                                                                          \
 4443-		STBIR_NO_UNROLL(dest);                                                 \
 4444-		((stbir_uint32 *)(dest))[0] = ((stbir_uint32 *)(src))[0];              \
 4445-	}
 4446-#define STBIR_MOVE_2(dest, src)                                                \
 4447-	{                                                                          \
 4448-		STBIR_NO_UNROLL(dest);                                                 \
 4449-		((stbir_uint64 *)(dest))[0] = ((stbir_uint64 *)(src))[0];              \
 4450-	}
 4451-#ifdef STBIR_SIMD
 4452-#define STBIR_MOVE_4(dest, src)                                                \
 4453-	{                                                                          \
 4454-		stbir__simdf t;                                                        \
 4455-		STBIR_NO_UNROLL(dest);                                                 \
 4456-		stbir__simdf_load(t, src);                                             \
 4457-		stbir__simdf_store(dest, t);                                           \
 4458-	}
 4459-#else
 4460-#define STBIR_MOVE_4(dest, src)                                                \
 4461-	{                                                                          \
 4462-		STBIR_NO_UNROLL(dest);                                                 \
 4463-		((stbir_uint64 *)(dest))[0] = ((stbir_uint64 *)(src))[0];              \
 4464-		((stbir_uint64 *)(dest))[1] = ((stbir_uint64 *)(src))[1];              \
 4465-	}
 4466-#endif
 4467-
 4468-	int row_end = row1 + 1;
 4469-	STBIR__UNUSED(row0); // only used in an assert
 4470-
 4471-	if (coefficient_width != widest) {
 4472-		float *pc = coefficents;
 4473-		float *coeffs = coefficents;
 4474-		float *pc_end = coefficents + num_contributors * widest;
 4475-		switch (widest) {
 4476-		case 1:
 4477-			STBIR_NO_UNROLL_LOOP_START
 4478-			do {
 4479-				STBIR_MOVE_1(pc, coeffs);
 4480-				++pc;
 4481-				coeffs += coefficient_width;
 4482-			} while (pc < pc_end);
 4483-			break;
 4484-		case 2:
 4485-			STBIR_NO_UNROLL_LOOP_START
 4486-			do {
 4487-				STBIR_MOVE_2(pc, coeffs);
 4488-				pc += 2;
 4489-				coeffs += coefficient_width;
 4490-			} while (pc < pc_end);
 4491-			break;
 4492-		case 3:
 4493-			STBIR_NO_UNROLL_LOOP_START
 4494-			do {
 4495-				STBIR_MOVE_2(pc, coeffs);
 4496-				STBIR_MOVE_1(pc + 2, coeffs + 2);
 4497-				pc += 3;
 4498-				coeffs += coefficient_width;
 4499-			} while (pc < pc_end);
 4500-			break;
 4501-		case 4:
 4502-			STBIR_NO_UNROLL_LOOP_START
 4503-			do {
 4504-				STBIR_MOVE_4(pc, coeffs);
 4505-				pc += 4;
 4506-				coeffs += coefficient_width;
 4507-			} while (pc < pc_end);
 4508-			break;
 4509-		case 5:
 4510-			STBIR_NO_UNROLL_LOOP_START
 4511-			do {
 4512-				STBIR_MOVE_4(pc, coeffs);
 4513-				STBIR_MOVE_1(pc + 4, coeffs + 4);
 4514-				pc += 5;
 4515-				coeffs += coefficient_width;
 4516-			} while (pc < pc_end);
 4517-			break;
 4518-		case 6:
 4519-			STBIR_NO_UNROLL_LOOP_START
 4520-			do {
 4521-				STBIR_MOVE_4(pc, coeffs);
 4522-				STBIR_MOVE_2(pc + 4, coeffs + 4);
 4523-				pc += 6;
 4524-				coeffs += coefficient_width;
 4525-			} while (pc < pc_end);
 4526-			break;
 4527-		case 7:
 4528-			STBIR_NO_UNROLL_LOOP_START
 4529-			do {
 4530-				STBIR_MOVE_4(pc, coeffs);
 4531-				STBIR_MOVE_2(pc + 4, coeffs + 4);
 4532-				STBIR_MOVE_1(pc + 6, coeffs + 6);
 4533-				pc += 7;
 4534-				coeffs += coefficient_width;
 4535-			} while (pc < pc_end);
 4536-			break;
 4537-		case 8:
 4538-			STBIR_NO_UNROLL_LOOP_START
 4539-			do {
 4540-				STBIR_MOVE_4(pc, coeffs);
 4541-				STBIR_MOVE_4(pc + 4, coeffs + 4);
 4542-				pc += 8;
 4543-				coeffs += coefficient_width;
 4544-			} while (pc < pc_end);
 4545-			break;
 4546-		case 9:
 4547-			STBIR_NO_UNROLL_LOOP_START
 4548-			do {
 4549-				STBIR_MOVE_4(pc, coeffs);
 4550-				STBIR_MOVE_4(pc + 4, coeffs + 4);
 4551-				STBIR_MOVE_1(pc + 8, coeffs + 8);
 4552-				pc += 9;
 4553-				coeffs += coefficient_width;
 4554-			} while (pc < pc_end);
 4555-			break;
 4556-		case 10:
 4557-			STBIR_NO_UNROLL_LOOP_START
 4558-			do {
 4559-				STBIR_MOVE_4(pc, coeffs);
 4560-				STBIR_MOVE_4(pc + 4, coeffs + 4);
 4561-				STBIR_MOVE_2(pc + 8, coeffs + 8);
 4562-				pc += 10;
 4563-				coeffs += coefficient_width;
 4564-			} while (pc < pc_end);
 4565-			break;
 4566-		case 11:
 4567-			STBIR_NO_UNROLL_LOOP_START
 4568-			do {
 4569-				STBIR_MOVE_4(pc, coeffs);
 4570-				STBIR_MOVE_4(pc + 4, coeffs + 4);
 4571-				STBIR_MOVE_2(pc + 8, coeffs + 8);
 4572-				STBIR_MOVE_1(pc + 10, coeffs + 10);
 4573-				pc += 11;
 4574-				coeffs += coefficient_width;
 4575-			} while (pc < pc_end);
 4576-			break;
 4577-		case 12:
 4578-			STBIR_NO_UNROLL_LOOP_START
 4579-			do {
 4580-				STBIR_MOVE_4(pc, coeffs);
 4581-				STBIR_MOVE_4(pc + 4, coeffs + 4);
 4582-				STBIR_MOVE_4(pc + 8, coeffs + 8);
 4583-				pc += 12;
 4584-				coeffs += coefficient_width;
 4585-			} while (pc < pc_end);
 4586-			break;
 4587-		default:
 4588-			STBIR_NO_UNROLL_LOOP_START
 4589-			do {
 4590-				float *copy_end = pc + widest - 4;
 4591-				float *c = coeffs;
 4592-				do {
 4593-					STBIR_NO_UNROLL(pc);
 4594-					STBIR_MOVE_4(pc, c);
 4595-					pc += 4;
 4596-					c += 4;
 4597-				} while (pc <= copy_end);
 4598-				copy_end += 4;
 4599-				STBIR_NO_UNROLL_LOOP_START
 4600-				while (pc < copy_end) {
 4601-					STBIR_MOVE_1(pc, c);
 4602-					++pc;
 4603-					++c;
 4604-				}
 4605-				coeffs += coefficient_width;
 4606-			} while (pc < pc_end);
 4607-			break;
 4608-		}
 4609-	}
 4610-
 4611-	// some horizontal routines read one float off the end (which is then masked
 4612-	// off), so put in a sentinal so we don't read an snan or denormal
 4613-	coefficents[widest * num_contributors] = 8888.0f;
 4614-
 4615-	// the minimum we might read for unrolled filters widths is 12. So, we need
 4616-	// to
 4617-	//   make sure we never read outside the decode buffer, by possibly moving
 4618-	//   the sample area back into the scanline, and putting zeros weights
 4619-	//   first.
 4620-	// we start on the right edge and check until we're well past the possible
 4621-	//   clip area (2*widest).
 4622-	{
 4623-		stbir__contributors *contribs = contributors + num_contributors - 1;
 4624-		float *coeffs = coefficents + widest * (num_contributors - 1);
 4625-
 4626-		// go until no chance of clipping (this is usually less than 8 lops)
 4627-		while ((contribs >= contributors) &&
 4628-		       ((contribs->n0 + widest * 2) >= row_end)) {
 4629-			// might we clip??
 4630-			if ((contribs->n0 + widest) > row_end) {
 4631-				int stop_range = widest;
 4632-
 4633-				// if range is larger than 12, it will be handled by generic
 4634-				// loops that can terminate on the exact length
 4635-				//   of this contrib n1, instead of a fixed widest amount - so
 4636-				//   calculate this
 4637-				if (widest > 12) {
 4638-					int mod;
 4639-
 4640-					// how far will be read in the n_coeff loop (which depends
 4641-					// on the widest count mod4);
 4642-					mod = widest & 3;
 4643-					stop_range =
 4644-					    (((contribs->n1 - contribs->n0 + 1) - mod + 3) & ~3) +
 4645-					    mod;
 4646-
 4647-					// the n_coeff loops do a minimum amount of coeffs, so
 4648-					// factor that in!
 4649-					if (stop_range < (8 + mod)) {
 4650-						stop_range = 8 + mod;
 4651-					}
 4652-				}
 4653-
 4654-				// now see if we still clip with the refined range
 4655-				if ((contribs->n0 + stop_range) > row_end) {
 4656-					int new_n0 = row_end - stop_range;
 4657-					int num = contribs->n1 - contribs->n0 + 1;
 4658-					int backup = contribs->n0 - new_n0;
 4659-					float *from_co = coeffs + num - 1;
 4660-					float *to_co = from_co + backup;
 4661-
 4662-					STBIR_ASSERT((new_n0 >= row0) && (new_n0 < contribs->n0));
 4663-
 4664-					// move the coeffs over
 4665-					while (num) {
 4666-						*to_co-- = *from_co--;
 4667-						--num;
 4668-					}
 4669-					// zero new positions
 4670-					while (to_co >= coeffs) {
 4671-						*to_co-- = 0;
 4672-					}
 4673-					// set new start point
 4674-					contribs->n0 = new_n0;
 4675-					if (widest > 12) {
 4676-						int mod;
 4677-
 4678-						// how far will be read in the n_coeff loop (which
 4679-						// depends on the widest count mod4);
 4680-						mod = widest & 3;
 4681-						stop_range =
 4682-						    (((contribs->n1 - contribs->n0 + 1) - mod + 3) &
 4683-						     ~3) +
 4684-						    mod;
 4685-
 4686-						// the n_coeff loops do a minimum amount of coeffs, so
 4687-						// factor that in!
 4688-						if (stop_range < (8 + mod)) {
 4689-							stop_range = 8 + mod;
 4690-						}
 4691-					}
 4692-				}
 4693-			}
 4694-			--contribs;
 4695-			coeffs -= widest;
 4696-		}
 4697-	}
 4698-
 4699-	return widest;
 4700-#undef STBIR_MOVE_1
 4701-#undef STBIR_MOVE_2
 4702-#undef STBIR_MOVE_4
 4703-}
 4704-
 4705-static void
 4706-stbir__calculate_filters(stbir__sampler *samp,
 4707-                         stbir__sampler *other_axis_for_pivot,
 4708-                         void *user_data STBIR_ONLY_PROFILE_BUILD_GET_INFO)
 4709-{
 4710-	int n;
 4711-	float scale = samp->scale_info.scale;
 4712-	stbir__kernel_callback *kernel = samp->filter_kernel;
 4713-	stbir__support_callback *support = samp->filter_support;
 4714-	float inv_scale = samp->scale_info.inv_scale;
 4715-	int input_full_size = samp->scale_info.input_full_size;
 4716-	int gather_num_contributors = samp->num_contributors;
 4717-	stbir__contributors *gather_contributors = samp->contributors;
 4718-	float *gather_coeffs = samp->coefficients;
 4719-	int gather_coefficient_width = samp->coefficient_width;
 4720-
 4721-	switch (samp->is_gather) {
 4722-	case 1: // gather upsample
 4723-	{
 4724-		float out_pixels_radius = support(inv_scale, user_data) * scale;
 4725-
 4726-		stbir__calculate_coefficients_for_gather_upsample(
 4727-		    out_pixels_radius, kernel, &samp->scale_info,
 4728-		    gather_num_contributors, gather_contributors, gather_coeffs,
 4729-		    gather_coefficient_width, samp->edge, user_data);
 4730-
 4731-		STBIR_PROFILE_BUILD_START(cleanup);
 4732-		stbir__cleanup_gathered_coefficients(
 4733-		    samp->edge, &samp->extent_info, &samp->scale_info,
 4734-		    gather_num_contributors, gather_contributors, gather_coeffs,
 4735-		    gather_coefficient_width);
 4736-		STBIR_PROFILE_BUILD_END(cleanup);
 4737-	} break;
 4738-
 4739-	case 0: // scatter downsample (only on vertical)
 4740-	case 2: // gather downsample
 4741-	{
 4742-		float in_pixels_radius = support(scale, user_data) * inv_scale;
 4743-		int filter_pixel_margin = samp->filter_pixel_margin;
 4744-		int input_end = input_full_size + filter_pixel_margin;
 4745-
 4746-		// if this is a scatter, we do a downsample gather to get the coeffs,
 4747-		// and then pivot after
 4748-		if (!samp->is_gather) {
 4749-			// check if we are using the same gather downsample on the
 4750-			// horizontal as this vertical,
 4751-			//   if so, then we don't have to generate them, we can just pivot
 4752-			//   from the horizontal.
 4753-			if (other_axis_for_pivot) {
 4754-				gather_contributors = other_axis_for_pivot->contributors;
 4755-				gather_coeffs = other_axis_for_pivot->coefficients;
 4756-				gather_coefficient_width =
 4757-				    other_axis_for_pivot->coefficient_width;
 4758-				gather_num_contributors =
 4759-				    other_axis_for_pivot->num_contributors;
 4760-				samp->extent_info.lowest =
 4761-				    other_axis_for_pivot->extent_info.lowest;
 4762-				samp->extent_info.highest =
 4763-				    other_axis_for_pivot->extent_info.highest;
 4764-				samp->extent_info.widest =
 4765-				    other_axis_for_pivot->extent_info.widest;
 4766-				goto jump_right_to_pivot;
 4767-			}
 4768-
 4769-			gather_contributors = samp->gather_prescatter_contributors;
 4770-			gather_coeffs = samp->gather_prescatter_coefficients;
 4771-			gather_coefficient_width =
 4772-			    samp->gather_prescatter_coefficient_width;
 4773-			gather_num_contributors = samp->gather_prescatter_num_contributors;
 4774-		}
 4775-
 4776-		stbir__calculate_coefficients_for_gather_downsample(
 4777-		    -filter_pixel_margin, input_end, in_pixels_radius, kernel,
 4778-		    &samp->scale_info, gather_coefficient_width,
 4779-		    gather_num_contributors, gather_contributors, gather_coeffs,
 4780-		    user_data);
 4781-
 4782-		STBIR_PROFILE_BUILD_START(cleanup);
 4783-		stbir__cleanup_gathered_coefficients(
 4784-		    samp->edge, &samp->extent_info, &samp->scale_info,
 4785-		    gather_num_contributors, gather_contributors, gather_coeffs,
 4786-		    gather_coefficient_width);
 4787-		STBIR_PROFILE_BUILD_END(cleanup);
 4788-
 4789-		if (!samp->is_gather) {
 4790-			// if this is a scatter (vertical only), then we need to pivot the
 4791-			// coeffs
 4792-			stbir__contributors *scatter_contributors;
 4793-			int highest_set;
 4794-
 4795-		jump_right_to_pivot:
 4796-
 4797-			STBIR_PROFILE_BUILD_START(pivot);
 4798-
 4799-			highest_set = (-filter_pixel_margin) - 1;
 4800-			for (n = 0; n < gather_num_contributors; n++) {
 4801-				int k;
 4802-				int gn0 = gather_contributors->n0,
 4803-				    gn1 = gather_contributors->n1;
 4804-				int scatter_coefficient_width = samp->coefficient_width;
 4805-				float *scatter_coeffs =
 4806-				    samp->coefficients +
 4807-				    (gn0 + filter_pixel_margin) * scatter_coefficient_width;
 4808-				float *g_coeffs = gather_coeffs;
 4809-				scatter_contributors =
 4810-				    samp->contributors + (gn0 + filter_pixel_margin);
 4811-
 4812-				for (k = gn0; k <= gn1; k++) {
 4813-					float gc = *g_coeffs++;
 4814-
 4815-					// skip zero and denormals - must skip zeros to avoid adding
 4816-					// coeffs beyond scatter_coefficient_width
 4817-					//   (which happens when pivoting from horizontal, which
 4818-					//   might have dummy zeros)
 4819-					if (((gc >= stbir__small_float) ||
 4820-					     (gc <= -stbir__small_float))) {
 4821-						if ((k > highest_set) || (scatter_contributors->n0 >
 4822-						                          scatter_contributors->n1)) {
 4823-							{
 4824-								// if we are skipping over several contributors,
 4825-								// we need to clear the skipped ones
 4826-								stbir__contributors *clear_contributors =
 4827-								    samp->contributors +
 4828-								    (highest_set + filter_pixel_margin + 1);
 4829-								while (clear_contributors <
 4830-								       scatter_contributors) {
 4831-									clear_contributors->n0 = 0;
 4832-									clear_contributors->n1 = -1;
 4833-									++clear_contributors;
 4834-								}
 4835-							}
 4836-							scatter_contributors->n0 = n;
 4837-							scatter_contributors->n1 = n;
 4838-							scatter_coeffs[0] = gc;
 4839-							highest_set = k;
 4840-						} else {
 4841-							stbir__insert_coeff(scatter_contributors,
 4842-							                    scatter_coeffs, n, gc,
 4843-							                    scatter_coefficient_width);
 4844-						}
 4845-						STBIR_ASSERT((scatter_contributors->n1 -
 4846-						              scatter_contributors->n0 + 1) <=
 4847-						             scatter_coefficient_width);
 4848-					}
 4849-					++scatter_contributors;
 4850-					scatter_coeffs += scatter_coefficient_width;
 4851-				}
 4852-
 4853-				++gather_contributors;
 4854-				gather_coeffs += gather_coefficient_width;
 4855-			}
 4856-
 4857-			// now clear any unset contribs
 4858-			{
 4859-				stbir__contributors *clear_contributors =
 4860-				    samp->contributors +
 4861-				    (highest_set + filter_pixel_margin + 1);
 4862-				stbir__contributors *end_contributors =
 4863-				    samp->contributors + samp->num_contributors;
 4864-				while (clear_contributors < end_contributors) {
 4865-					clear_contributors->n0 = 0;
 4866-					clear_contributors->n1 = -1;
 4867-					++clear_contributors;
 4868-				}
 4869-			}
 4870-
 4871-			STBIR_PROFILE_BUILD_END(pivot);
 4872-		}
 4873-	} break;
 4874-	}
 4875-}
 4876-
 4877-//========================================================================================================
 4878-// scanline decoders and encoders
 4879-
 4880-#define stbir__coder_min_num 1
 4881-#define STB_IMAGE_RESIZE_DO_CODERS
 4882-#include STBIR__HEADER_FILENAME
 4883-
 4884-#define stbir__decode_suffix BGRA
 4885-#define stbir__decode_swizzle
 4886-#define stbir__decode_order0 2
 4887-#define stbir__decode_order1 1
 4888-#define stbir__decode_order2 0
 4889-#define stbir__decode_order3 3
 4890-#define stbir__encode_order0 2
 4891-#define stbir__encode_order1 1
 4892-#define stbir__encode_order2 0
 4893-#define stbir__encode_order3 3
 4894-#define stbir__coder_min_num 4
 4895-#define STB_IMAGE_RESIZE_DO_CODERS
 4896-#include STBIR__HEADER_FILENAME
 4897-
 4898-#define stbir__decode_suffix ARGB
 4899-#define stbir__decode_swizzle
 4900-#define stbir__decode_order0 1
 4901-#define stbir__decode_order1 2
 4902-#define stbir__decode_order2 3
 4903-#define stbir__decode_order3 0
 4904-#define stbir__encode_order0 3
 4905-#define stbir__encode_order1 0
 4906-#define stbir__encode_order2 1
 4907-#define stbir__encode_order3 2
 4908-#define stbir__coder_min_num 4
 4909-#define STB_IMAGE_RESIZE_DO_CODERS
 4910-#include STBIR__HEADER_FILENAME
 4911-
 4912-#define stbir__decode_suffix ABGR
 4913-#define stbir__decode_swizzle
 4914-#define stbir__decode_order0 3
 4915-#define stbir__decode_order1 2
 4916-#define stbir__decode_order2 1
 4917-#define stbir__decode_order3 0
 4918-#define stbir__encode_order0 3
 4919-#define stbir__encode_order1 2
 4920-#define stbir__encode_order2 1
 4921-#define stbir__encode_order3 0
 4922-#define stbir__coder_min_num 4
 4923-#define STB_IMAGE_RESIZE_DO_CODERS
 4924-#include STBIR__HEADER_FILENAME
 4925-
 4926-#define stbir__decode_suffix AR
 4927-#define stbir__decode_swizzle
 4928-#define stbir__decode_order0 1
 4929-#define stbir__decode_order1 0
 4930-#define stbir__decode_order2 3
 4931-#define stbir__decode_order3 2
 4932-#define stbir__encode_order0 1
 4933-#define stbir__encode_order1 0
 4934-#define stbir__encode_order2 3
 4935-#define stbir__encode_order3 2
 4936-#define stbir__coder_min_num 2
 4937-#define STB_IMAGE_RESIZE_DO_CODERS
 4938-#include STBIR__HEADER_FILENAME
 4939-
 4940-// fancy alpha means we expand to keep both premultipied and non-premultiplied
 4941-// color channels
 4942-static void
 4943-stbir__fancy_alpha_weight_4ch(float *out_buffer, int width_times_channels)
 4944-{
 4945-	float STBIR_STREAMOUT_PTR(*) out = out_buffer;
 4946-	float const *end_decode =
 4947-	    out_buffer + (width_times_channels / 4) *
 4948-	                     7; // decode buffer aligned to end of out_buffer
 4949-	float STBIR_STREAMOUT_PTR(*) decode =
 4950-	    (float *)end_decode - width_times_channels;
 4951-
 4952-	// fancy alpha is stored internally as R G B A Rpm Gpm Bpm
 4953-
 4954-#ifdef STBIR_SIMD
 4955-
 4956-#ifdef STBIR_SIMD8
 4957-	decode += 16;
 4958-	STBIR_NO_UNROLL_LOOP_START
 4959-	while (decode <= end_decode) {
 4960-		stbir__simdf8 d0, d1, a0, a1, p0, p1;
 4961-		STBIR_NO_UNROLL(decode);
 4962-		stbir__simdf8_load(d0, decode - 16);
 4963-		stbir__simdf8_load(d1, decode - 16 + 8);
 4964-		stbir__simdf8_0123to33333333(a0, d0);
 4965-		stbir__simdf8_0123to33333333(a1, d1);
 4966-		stbir__simdf8_mult(p0, a0, d0);
 4967-		stbir__simdf8_mult(p1, a1, d1);
 4968-		stbir__simdf8_bot4s(a0, d0, p0);
 4969-		stbir__simdf8_bot4s(a1, d1, p1);
 4970-		stbir__simdf8_top4s(d0, d0, p0);
 4971-		stbir__simdf8_top4s(d1, d1, p1);
 4972-		stbir__simdf8_store(out, a0);
 4973-		stbir__simdf8_store(out + 7, d0);
 4974-		stbir__simdf8_store(out + 14, a1);
 4975-		stbir__simdf8_store(out + 21, d1);
 4976-		decode += 16;
 4977-		out += 28;
 4978-	}
 4979-	decode -= 16;
 4980-#else
 4981-	decode += 8;
 4982-	STBIR_NO_UNROLL_LOOP_START
 4983-	while (decode <= end_decode) {
 4984-		stbir__simdf d0, a0, d1, a1, p0, p1;
 4985-		STBIR_NO_UNROLL(decode);
 4986-		stbir__simdf_load(d0, decode - 8);
 4987-		stbir__simdf_load(d1, decode - 8 + 4);
 4988-		stbir__simdf_0123to3333(a0, d0);
 4989-		stbir__simdf_0123to3333(a1, d1);
 4990-		stbir__simdf_mult(p0, a0, d0);
 4991-		stbir__simdf_mult(p1, a1, d1);
 4992-		stbir__simdf_store(out, d0);
 4993-		stbir__simdf_store(out + 4, p0);
 4994-		stbir__simdf_store(out + 7, d1);
 4995-		stbir__simdf_store(out + 7 + 4, p1);
 4996-		decode += 8;
 4997-		out += 14;
 4998-	}
 4999-	decode -= 8;
 5000-#endif
 5001-
 5002-// might be one last odd pixel
 5003-#ifdef STBIR_SIMD8
 5004-	STBIR_NO_UNROLL_LOOP_START
 5005-	while (decode < end_decode)
 5006-#else
 5007-	if (decode < end_decode)
 5008-#endif
 5009-	{
 5010-		stbir__simdf d, a, p;
 5011-		STBIR_NO_UNROLL(decode);
 5012-		stbir__simdf_load(d, decode);
 5013-		stbir__simdf_0123to3333(a, d);
 5014-		stbir__simdf_mult(p, a, d);
 5015-		stbir__simdf_store(out, d);
 5016-		stbir__simdf_store(out + 4, p);
 5017-		decode += 4;
 5018-		out += 7;
 5019-	}
 5020-
 5021-#else
 5022-
 5023-	while (decode < end_decode) {
 5024-		float r = decode[0], g = decode[1], b = decode[2], alpha = decode[3];
 5025-		out[0] = r;
 5026-		out[1] = g;
 5027-		out[2] = b;
 5028-		out[3] = alpha;
 5029-		out[4] = r * alpha;
 5030-		out[5] = g * alpha;
 5031-		out[6] = b * alpha;
 5032-		out += 7;
 5033-		decode += 4;
 5034-	}
 5035-
 5036-#endif
 5037-}
 5038-
 5039-static void
 5040-stbir__fancy_alpha_weight_2ch(float *out_buffer, int width_times_channels)
 5041-{
 5042-	float STBIR_STREAMOUT_PTR(*) out = out_buffer;
 5043-	float const *end_decode = out_buffer + (width_times_channels / 2) * 3;
 5044-	float STBIR_STREAMOUT_PTR(*) decode =
 5045-	    (float *)end_decode - width_times_channels;
 5046-
 5047-	//  for fancy alpha, turns into: [X A Xpm][X A Xpm],etc
 5048-
 5049-#ifdef STBIR_SIMD
 5050-
 5051-	decode += 8;
 5052-	if (decode <= end_decode) {
 5053-		STBIR_NO_UNROLL_LOOP_START
 5054-		do {
 5055-#ifdef STBIR_SIMD8
 5056-			stbir__simdf8 d0, a0, p0;
 5057-			STBIR_NO_UNROLL(decode);
 5058-			stbir__simdf8_load(d0, decode - 8);
 5059-			stbir__simdf8_0123to11331133(p0, d0);
 5060-			stbir__simdf8_0123to00220022(a0, d0);
 5061-			stbir__simdf8_mult(p0, p0, a0);
 5062-
 5063-			stbir__simdf_store2(out, stbir__if_simdf8_cast_to_simdf4(d0));
 5064-			stbir__simdf_store(out + 2, stbir__if_simdf8_cast_to_simdf4(p0));
 5065-			stbir__simdf_store2h(out + 3, stbir__if_simdf8_cast_to_simdf4(d0));
 5066-
 5067-			stbir__simdf_store2(out + 6, stbir__simdf8_gettop4(d0));
 5068-			stbir__simdf_store(out + 8, stbir__simdf8_gettop4(p0));
 5069-			stbir__simdf_store2h(out + 9, stbir__simdf8_gettop4(d0));
 5070-#else
 5071-			stbir__simdf d0, a0, d1, a1, p0, p1;
 5072-			STBIR_NO_UNROLL(decode);
 5073-			stbir__simdf_load(d0, decode - 8);
 5074-			stbir__simdf_load(d1, decode - 8 + 4);
 5075-			stbir__simdf_0123to1133(p0, d0);
 5076-			stbir__simdf_0123to1133(p1, d1);
 5077-			stbir__simdf_0123to0022(a0, d0);
 5078-			stbir__simdf_0123to0022(a1, d1);
 5079-			stbir__simdf_mult(p0, p0, a0);
 5080-			stbir__simdf_mult(p1, p1, a1);
 5081-
 5082-			stbir__simdf_store2(out, d0);
 5083-			stbir__simdf_store(out + 2, p0);
 5084-			stbir__simdf_store2h(out + 3, d0);
 5085-
 5086-			stbir__simdf_store2(out + 6, d1);
 5087-			stbir__simdf_store(out + 8, p1);
 5088-			stbir__simdf_store2h(out + 9, d1);
 5089-#endif
 5090-			decode += 8;
 5091-			out += 12;
 5092-		} while (decode <= end_decode);
 5093-	}
 5094-	decode -= 8;
 5095-#endif
 5096-
 5097-	STBIR_SIMD_NO_UNROLL_LOOP_START
 5098-	while (decode < end_decode) {
 5099-		float x = decode[0], y = decode[1];
 5100-		STBIR_SIMD_NO_UNROLL(decode);
 5101-		out[0] = x;
 5102-		out[1] = y;
 5103-		out[2] = x * y;
 5104-		out += 3;
 5105-		decode += 2;
 5106-	}
 5107-}
 5108-
 5109-static void
 5110-stbir__fancy_alpha_unweight_4ch(float *encode_buffer, int width_times_channels)
 5111-{
 5112-	float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
 5113-	float STBIR_SIMD_STREAMOUT_PTR(*) input = encode_buffer;
 5114-	float const *end_output = encode_buffer + width_times_channels;
 5115-
 5116-	// fancy RGBA is stored internally as R G B A Rpm Gpm Bpm
 5117-
 5118-	STBIR_SIMD_NO_UNROLL_LOOP_START
 5119-	do {
 5120-		float alpha = input[3];
 5121-#ifdef STBIR_SIMD
 5122-		stbir__simdf i, ia;
 5123-		STBIR_SIMD_NO_UNROLL(encode);
 5124-		if (alpha < stbir__small_float) {
 5125-			stbir__simdf_load(i, input);
 5126-			stbir__simdf_store(encode, i);
 5127-		} else {
 5128-			stbir__simdf_load1frep4(ia, 1.0f / alpha);
 5129-			stbir__simdf_load(i, input + 4);
 5130-			stbir__simdf_mult(i, i, ia);
 5131-			stbir__simdf_store(encode, i);
 5132-			encode[3] = alpha;
 5133-		}
 5134-#else
 5135-		if (alpha < stbir__small_float) {
 5136-			encode[0] = input[0];
 5137-			encode[1] = input[1];
 5138-			encode[2] = input[2];
 5139-		} else {
 5140-			float ialpha = 1.0f / alpha;
 5141-			encode[0] = input[4] * ialpha;
 5142-			encode[1] = input[5] * ialpha;
 5143-			encode[2] = input[6] * ialpha;
 5144-		}
 5145-		encode[3] = alpha;
 5146-#endif
 5147-
 5148-		input += 7;
 5149-		encode += 4;
 5150-	} while (encode < end_output);
 5151-}
 5152-
 5153-//  format: [X A Xpm][X A Xpm] etc
 5154-static void
 5155-stbir__fancy_alpha_unweight_2ch(float *encode_buffer, int width_times_channels)
 5156-{
 5157-	float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
 5158-	float STBIR_SIMD_STREAMOUT_PTR(*) input = encode_buffer;
 5159-	float const *end_output = encode_buffer + width_times_channels;
 5160-
 5161-	do {
 5162-		float alpha = input[1];
 5163-		encode[0] = input[0];
 5164-		if (alpha >= stbir__small_float) {
 5165-			encode[0] = input[2] / alpha;
 5166-		}
 5167-		encode[1] = alpha;
 5168-
 5169-		input += 3;
 5170-		encode += 2;
 5171-	} while (encode < end_output);
 5172-}
 5173-
 5174-static void
 5175-stbir__simple_alpha_weight_4ch(float *decode_buffer, int width_times_channels)
 5176-{
 5177-	float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
 5178-	float const *end_decode = decode_buffer + width_times_channels;
 5179-
 5180-#ifdef STBIR_SIMD
 5181-	{
 5182-		decode += 2 * stbir__simdfX_float_count;
 5183-		STBIR_NO_UNROLL_LOOP_START
 5184-		while (decode <= end_decode) {
 5185-			stbir__simdfX d0, a0, d1, a1;
 5186-			STBIR_NO_UNROLL(decode);
 5187-			stbir__simdfX_load(d0, decode - 2 * stbir__simdfX_float_count);
 5188-			stbir__simdfX_load(d1, decode - 2 * stbir__simdfX_float_count +
 5189-			                           stbir__simdfX_float_count);
 5190-			stbir__simdfX_aaa1(a0, d0, STBIR_onesX);
 5191-			stbir__simdfX_aaa1(a1, d1, STBIR_onesX);
 5192-			stbir__simdfX_mult(d0, d0, a0);
 5193-			stbir__simdfX_mult(d1, d1, a1);
 5194-			stbir__simdfX_store(decode - 2 * stbir__simdfX_float_count, d0);
 5195-			stbir__simdfX_store(decode - 2 * stbir__simdfX_float_count +
 5196-			                        stbir__simdfX_float_count,
 5197-			                    d1);
 5198-			decode += 2 * stbir__simdfX_float_count;
 5199-		}
 5200-		decode -= 2 * stbir__simdfX_float_count;
 5201-
 5202-// few last pixels remnants
 5203-#ifdef STBIR_SIMD8
 5204-		STBIR_NO_UNROLL_LOOP_START
 5205-		while (decode < end_decode)
 5206-#else
 5207-		if (decode < end_decode)
 5208-#endif
 5209-		{
 5210-			stbir__simdf d, a;
 5211-			stbir__simdf_load(d, decode);
 5212-			stbir__simdf_aaa1(a, d, STBIR__CONSTF(STBIR_ones));
 5213-			stbir__simdf_mult(d, d, a);
 5214-			stbir__simdf_store(decode, d);
 5215-			decode += 4;
 5216-		}
 5217-	}
 5218-
 5219-#else
 5220-
 5221-	while (decode < end_decode) {
 5222-		float alpha = decode[3];
 5223-		decode[0] *= alpha;
 5224-		decode[1] *= alpha;
 5225-		decode[2] *= alpha;
 5226-		decode += 4;
 5227-	}
 5228-
 5229-#endif
 5230-}
 5231-
 5232-static void
 5233-stbir__simple_alpha_weight_2ch(float *decode_buffer, int width_times_channels)
 5234-{
 5235-	float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
 5236-	float const *end_decode = decode_buffer + width_times_channels;
 5237-
 5238-#ifdef STBIR_SIMD
 5239-	decode += 2 * stbir__simdfX_float_count;
 5240-	STBIR_NO_UNROLL_LOOP_START
 5241-	while (decode <= end_decode) {
 5242-		stbir__simdfX d0, a0, d1, a1;
 5243-		STBIR_NO_UNROLL(decode);
 5244-		stbir__simdfX_load(d0, decode - 2 * stbir__simdfX_float_count);
 5245-		stbir__simdfX_load(d1, decode - 2 * stbir__simdfX_float_count +
 5246-		                           stbir__simdfX_float_count);
 5247-		stbir__simdfX_a1a1(a0, d0, STBIR_onesX);
 5248-		stbir__simdfX_a1a1(a1, d1, STBIR_onesX);
 5249-		stbir__simdfX_mult(d0, d0, a0);
 5250-		stbir__simdfX_mult(d1, d1, a1);
 5251-		stbir__simdfX_store(decode - 2 * stbir__simdfX_float_count, d0);
 5252-		stbir__simdfX_store(decode - 2 * stbir__simdfX_float_count +
 5253-		                        stbir__simdfX_float_count,
 5254-		                    d1);
 5255-		decode += 2 * stbir__simdfX_float_count;
 5256-	}
 5257-	decode -= 2 * stbir__simdfX_float_count;
 5258-#endif
 5259-
 5260-	STBIR_SIMD_NO_UNROLL_LOOP_START
 5261-	while (decode < end_decode) {
 5262-		float alpha = decode[1];
 5263-		STBIR_SIMD_NO_UNROLL(decode);
 5264-		decode[0] *= alpha;
 5265-		decode += 2;
 5266-	}
 5267-}
 5268-
 5269-static void
 5270-stbir__simple_alpha_unweight_4ch(float *encode_buffer, int width_times_channels)
 5271-{
 5272-	float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
 5273-	float const *end_output = encode_buffer + width_times_channels;
 5274-
 5275-	STBIR_SIMD_NO_UNROLL_LOOP_START
 5276-	do {
 5277-		float alpha = encode[3];
 5278-
 5279-#ifdef STBIR_SIMD
 5280-		stbir__simdf i, ia;
 5281-		STBIR_SIMD_NO_UNROLL(encode);
 5282-		if (alpha >= stbir__small_float) {
 5283-			stbir__simdf_load1frep4(ia, 1.0f / alpha);
 5284-			stbir__simdf_load(i, encode);
 5285-			stbir__simdf_mult(i, i, ia);
 5286-			stbir__simdf_store(encode, i);
 5287-			encode[3] = alpha;
 5288-		}
 5289-#else
 5290-		if (alpha >= stbir__small_float) {
 5291-			float ialpha = 1.0f / alpha;
 5292-			encode[0] *= ialpha;
 5293-			encode[1] *= ialpha;
 5294-			encode[2] *= ialpha;
 5295-		}
 5296-#endif
 5297-		encode += 4;
 5298-	} while (encode < end_output);
 5299-}
 5300-
 5301-static void
 5302-stbir__simple_alpha_unweight_2ch(float *encode_buffer, int width_times_channels)
 5303-{
 5304-	float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
 5305-	float const *end_output = encode_buffer + width_times_channels;
 5306-
 5307-	do {
 5308-		float alpha = encode[1];
 5309-		if (alpha >= stbir__small_float) {
 5310-			encode[0] /= alpha;
 5311-		}
 5312-		encode += 2;
 5313-	} while (encode < end_output);
 5314-}
 5315-
 5316-// only used in RGB->BGR or BGR->RGB
 5317-static void
 5318-stbir__simple_flip_3ch(float *decode_buffer, int width_times_channels)
 5319-{
 5320-	float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
 5321-	float const *end_decode = decode_buffer + width_times_channels;
 5322-
 5323-#ifdef STBIR_SIMD
 5324-#ifdef stbir__simdf_swiz2 // do we have two argument swizzles?
 5325-	end_decode -= 12;
 5326-	STBIR_NO_UNROLL_LOOP_START
 5327-	while (decode <= end_decode) {
 5328-		// on arm64 8 instructions, no overlapping stores
 5329-		stbir__simdf a, b, c, na, nb;
 5330-		STBIR_SIMD_NO_UNROLL(decode);
 5331-		stbir__simdf_load(a, decode);
 5332-		stbir__simdf_load(b, decode + 4);
 5333-		stbir__simdf_load(c, decode + 8);
 5334-
 5335-		na = stbir__simdf_swiz2(a, b, 2, 1, 0, 5);
 5336-		b = stbir__simdf_swiz2(a, b, 4, 3, 6, 7);
 5337-		nb = stbir__simdf_swiz2(b, c, 0, 1, 4, 3);
 5338-		c = stbir__simdf_swiz2(b, c, 2, 7, 6, 5);
 5339-
 5340-		stbir__simdf_store(decode, na);
 5341-		stbir__simdf_store(decode + 4, nb);
 5342-		stbir__simdf_store(decode + 8, c);
 5343-		decode += 12;
 5344-	}
 5345-	end_decode += 12;
 5346-#else
 5347-	end_decode -= 24;
 5348-	STBIR_NO_UNROLL_LOOP_START
 5349-	while (decode <= end_decode) {
 5350-		// 26 instructions on x64
 5351-		stbir__simdf a, b, c, d, e, f, g;
 5352-		float i21, i23;
 5353-		STBIR_SIMD_NO_UNROLL(decode);
 5354-		stbir__simdf_load(a, decode);
 5355-		stbir__simdf_load(b, decode + 3);
 5356-		stbir__simdf_load(c, decode + 6);
 5357-		stbir__simdf_load(d, decode + 9);
 5358-		stbir__simdf_load(e, decode + 12);
 5359-		stbir__simdf_load(f, decode + 15);
 5360-		stbir__simdf_load(g, decode + 18);
 5361-
 5362-		a = stbir__simdf_swiz(a, 2, 1, 0, 3);
 5363-		b = stbir__simdf_swiz(b, 2, 1, 0, 3);
 5364-		c = stbir__simdf_swiz(c, 2, 1, 0, 3);
 5365-		d = stbir__simdf_swiz(d, 2, 1, 0, 3);
 5366-		e = stbir__simdf_swiz(e, 2, 1, 0, 3);
 5367-		f = stbir__simdf_swiz(f, 2, 1, 0, 3);
 5368-		g = stbir__simdf_swiz(g, 2, 1, 0, 3);
 5369-
 5370-		// stores overlap, need to be in order,
 5371-		stbir__simdf_store(decode, a);
 5372-		i21 = decode[21];
 5373-		stbir__simdf_store(decode + 3, b);
 5374-		i23 = decode[23];
 5375-		stbir__simdf_store(decode + 6, c);
 5376-		stbir__simdf_store(decode + 9, d);
 5377-		stbir__simdf_store(decode + 12, e);
 5378-		stbir__simdf_store(decode + 15, f);
 5379-		stbir__simdf_store(decode + 18, g);
 5380-		decode[21] = i23;
 5381-		decode[23] = i21;
 5382-		decode += 24;
 5383-	}
 5384-	end_decode += 24;
 5385-#endif
 5386-#else
 5387-	end_decode -= 12;
 5388-	STBIR_NO_UNROLL_LOOP_START
 5389-	while (decode <= end_decode) {
 5390-		// 16 instructions
 5391-		float t0, t1, t2, t3;
 5392-		STBIR_NO_UNROLL(decode);
 5393-		t0 = decode[0];
 5394-		t1 = decode[3];
 5395-		t2 = decode[6];
 5396-		t3 = decode[9];
 5397-		decode[0] = decode[2];
 5398-		decode[3] = decode[5];
 5399-		decode[6] = decode[8];
 5400-		decode[9] = decode[11];
 5401-		decode[2] = t0;
 5402-		decode[5] = t1;
 5403-		decode[8] = t2;
 5404-		decode[11] = t3;
 5405-		decode += 12;
 5406-	}
 5407-	end_decode += 12;
 5408-#endif
 5409-
 5410-	STBIR_NO_UNROLL_LOOP_START
 5411-	while (decode < end_decode) {
 5412-		float t = decode[0];
 5413-		STBIR_NO_UNROLL(decode);
 5414-		decode[0] = decode[2];
 5415-		decode[2] = t;
 5416-		decode += 3;
 5417-	}
 5418-}
 5419-
 5420-static void
 5421-stbir__decode_scanline(stbir__info const *stbir_info, int n,
 5422-                       float *output_buffer STBIR_ONLY_PROFILE_GET_SPLIT_INFO)
 5423-{
 5424-	int channels = stbir_info->channels;
 5425-	int effective_channels = stbir_info->effective_channels;
 5426-	int input_sample_in_bytes =
 5427-	    stbir__type_size[stbir_info->input_type] * channels;
 5428-	stbir_edge edge_horizontal = stbir_info->horizontal.edge;
 5429-	stbir_edge edge_vertical = stbir_info->vertical.edge;
 5430-	int row = stbir__edge_wrap(edge_vertical, n,
 5431-	                           stbir_info->vertical.scale_info.input_full_size);
 5432-	const void *input_plane_data =
 5433-	    ((char *)stbir_info->input_data) +
 5434-	    (size_t)row * (size_t)stbir_info->input_stride_bytes;
 5435-	stbir__span const *spans = stbir_info->scanline_extents.spans;
 5436-	float *full_decode_buffer =
 5437-	    output_buffer -
 5438-	    stbir_info->scanline_extents.conservative.n0 * effective_channels;
 5439-	float *last_decoded = 0;
 5440-
 5441-	// if we are on edge_zero, and we get in here with an out of bounds n, then
 5442-	// the calculate filters has failed
 5443-	STBIR_ASSERT(
 5444-	    !(edge_vertical == STBIR_EDGE_ZERO &&
 5445-	      (n < 0 || n >= stbir_info->vertical.scale_info.input_full_size)));
 5446-
 5447-	do {
 5448-		float *decode_buffer;
 5449-		void const *input_data;
 5450-		float *end_decode;
 5451-		int width_times_channels;
 5452-		int width;
 5453-
 5454-		if (spans->n1 < spans->n0) {
 5455-			break;
 5456-		}
 5457-
 5458-		width = spans->n1 + 1 - spans->n0;
 5459-		decode_buffer = full_decode_buffer + spans->n0 * effective_channels;
 5460-		end_decode = full_decode_buffer + (spans->n1 + 1) * effective_channels;
 5461-		width_times_channels = width * channels;
 5462-
 5463-		// read directly out of input plane by default
 5464-		input_data = ((char *)input_plane_data) +
 5465-		             spans->pixel_offset_for_input * input_sample_in_bytes;
 5466-
 5467-		// if we have an input callback, call it to get the input data
 5468-		if (stbir_info->in_pixels_cb) {
 5469-			// call the callback with a temp buffer (that they can choose to use
 5470-			// or not).  the temp is just right aligned memory in the
 5471-			// decode_buffer itself
 5472-			input_data = stbir_info->in_pixels_cb(
 5473-			    ((char *)end_decode) - (width * input_sample_in_bytes) +
 5474-			        ((stbir_info->input_type != STBIR_TYPE_FLOAT)
 5475-			             ? (sizeof(float) * STBIR_INPUT_CALLBACK_PADDING)
 5476-			             : 0),
 5477-			    input_plane_data, width, spans->pixel_offset_for_input, row,
 5478-			    stbir_info->user_data);
 5479-		}
 5480-
 5481-		STBIR_PROFILE_START(decode);
 5482-		// convert the pixels info the float decode_buffer, (we index from
 5483-		// end_decode, so that when channels<effective_channels, we are right
 5484-		// justified in the buffer)
 5485-		last_decoded = stbir_info->decode_pixels(
 5486-		    (float *)end_decode - width_times_channels, width_times_channels,
 5487-		    input_data);
 5488-		STBIR_PROFILE_END(decode);
 5489-
 5490-		if (stbir_info->alpha_weight) {
 5491-			STBIR_PROFILE_START(alpha);
 5492-			stbir_info->alpha_weight(decode_buffer, width_times_channels);
 5493-			STBIR_PROFILE_END(alpha);
 5494-		}
 5495-
 5496-		++spans;
 5497-	} while (spans <= (&stbir_info->scanline_extents.spans[1]));
 5498-
 5499-	// handle the edge_wrap filter (all other types are handled back out at the
 5500-	// calculate_filter stage) basically the idea here is that if we have the
 5501-	// whole scanline in memory, we don't redecode the
 5502-	//   wrapped edge pixels, and instead just memcpy them from the scanline
 5503-	//   into the edge positions
 5504-	if ((edge_horizontal == STBIR_EDGE_WRAP) &&
 5505-	    (stbir_info->scanline_extents.edge_sizes[0] |
 5506-	     stbir_info->scanline_extents.edge_sizes[1])) {
 5507-		// this code only runs if we're in edge_wrap, and we're doing the entire
 5508-		// scanline
 5509-		int e, start_x[2];
 5510-		int input_full_size = stbir_info->horizontal.scale_info.input_full_size;
 5511-
 5512-		start_x[0] =
 5513-		    -stbir_info->scanline_extents.edge_sizes[0]; // left edge start x
 5514-		start_x[1] = input_full_size;                    // right edge
 5515-
 5516-		for (e = 0; e < 2; e++) {
 5517-			// do each margin
 5518-			int margin = stbir_info->scanline_extents.edge_sizes[e];
 5519-			if (margin) {
 5520-				int x = start_x[e];
 5521-				float *marg = full_decode_buffer + x * effective_channels;
 5522-				float const *src =
 5523-				    full_decode_buffer +
 5524-				    stbir__edge_wrap(edge_horizontal, x, input_full_size) *
 5525-				        effective_channels;
 5526-				STBIR_MEMCPY(marg, src,
 5527-				             margin * effective_channels * sizeof(float));
 5528-				if (e == 1) {
 5529-					last_decoded = marg + margin * effective_channels;
 5530-				}
 5531-			}
 5532-		}
 5533-	}
 5534-
 5535-	// some of the horizontal gathers read one float off the edge (which is
 5536-	// masked out), but we force a zero here to make sure no NaNs leak in
 5537-	//   (we can't pre-zero it, because the input callback can use that area as
 5538-	//   padding)
 5539-	last_decoded[0] = 0.0f;
 5540-
 5541-	// we clear this extra float, because the final output pixel filter kernel
 5542-	// might have used one less coeff than the max filter width
 5543-	//   when this happens, we do read that pixel from the input, so it too
 5544-	//   could be Nan, so just zero an extra one. this fits because each
 5545-	//   scanline is padded by three floats (STBIR_INPUT_CALLBACK_PADDING)
 5546-	last_decoded[1] = 0.0f;
 5547-}
 5548-
 5549-//=================
 5550-// Do 1 channel horizontal routines
 5551-
 5552-#ifdef STBIR_SIMD
 5553-
 5554-#define stbir__1_coeff_only()                                                  \
 5555-	stbir__simdf tot, c;                                                       \
 5556-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5557-	stbir__simdf_load1(c, hc);                                                 \
 5558-	stbir__simdf_mult1_mem(tot, c, decode);
 5559-
 5560-#define stbir__2_coeff_only()                                                  \
 5561-	stbir__simdf tot, c, d;                                                    \
 5562-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5563-	stbir__simdf_load2z(c, hc);                                                \
 5564-	stbir__simdf_load2(d, decode);                                             \
 5565-	stbir__simdf_mult(tot, c, d);                                              \
 5566-	stbir__simdf_0123to1230(c, tot);                                           \
 5567-	stbir__simdf_add1(tot, tot, c);
 5568-
 5569-#define stbir__3_coeff_only()                                                  \
 5570-	stbir__simdf tot, c, t;                                                    \
 5571-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5572-	stbir__simdf_load(c, hc);                                                  \
 5573-	stbir__simdf_mult_mem(tot, c, decode);                                     \
 5574-	stbir__simdf_0123to1230(c, tot);                                           \
 5575-	stbir__simdf_0123to2301(t, tot);                                           \
 5576-	stbir__simdf_add1(tot, tot, c);                                            \
 5577-	stbir__simdf_add1(tot, tot, t);
 5578-
 5579-#define stbir__store_output_tiny()                                             \
 5580-	stbir__simdf_store1(output, tot);                                          \
 5581-	horizontal_coefficients += coefficient_width;                              \
 5582-	++horizontal_contributors;                                                 \
 5583-	output += 1;
 5584-
 5585-#define stbir__4_coeff_start()                                                 \
 5586-	stbir__simdf tot, c;                                                       \
 5587-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5588-	stbir__simdf_load(c, hc);                                                  \
 5589-	stbir__simdf_mult_mem(tot, c, decode);
 5590-
 5591-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 5592-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5593-	stbir__simdf_load(c, hc + (ofs));                                          \
 5594-	stbir__simdf_madd_mem(tot, tot, c, decode + (ofs));
 5595-
 5596-#define stbir__1_coeff_remnant(ofs)                                            \
 5597-	{                                                                          \
 5598-		stbir__simdf d;                                                        \
 5599-		stbir__simdf_load1z(c, hc + (ofs));                                    \
 5600-		stbir__simdf_load1(d, decode + (ofs));                                 \
 5601-		stbir__simdf_madd(tot, tot, d, c);                                     \
 5602-	}
 5603-
 5604-#define stbir__2_coeff_remnant(ofs)                                            \
 5605-	{                                                                          \
 5606-		stbir__simdf d;                                                        \
 5607-		stbir__simdf_load2z(c, hc + (ofs));                                    \
 5608-		stbir__simdf_load2(d, decode + (ofs));                                 \
 5609-		stbir__simdf_madd(tot, tot, d, c);                                     \
 5610-	}
 5611-
 5612-#define stbir__3_coeff_setup()                                                 \
 5613-	stbir__simdf mask;                                                         \
 5614-	stbir__simdf_load(mask, STBIR_mask + 3);
 5615-
 5616-#define stbir__3_coeff_remnant(ofs)                                            \
 5617-	stbir__simdf_load(c, hc + (ofs));                                          \
 5618-	stbir__simdf_and(c, c, mask);                                              \
 5619-	stbir__simdf_madd_mem(tot, tot, c, decode + (ofs));
 5620-
 5621-#define stbir__store_output()                                                  \
 5622-	stbir__simdf_0123to2301(c, tot);                                           \
 5623-	stbir__simdf_add(tot, tot, c);                                             \
 5624-	stbir__simdf_0123to1230(c, tot);                                           \
 5625-	stbir__simdf_add1(tot, tot, c);                                            \
 5626-	stbir__simdf_store1(output, tot);                                          \
 5627-	horizontal_coefficients += coefficient_width;                              \
 5628-	++horizontal_contributors;                                                 \
 5629-	output += 1;
 5630-
 5631-#else
 5632-
 5633-#define stbir__1_coeff_only()                                                  \
 5634-	float tot;                                                                 \
 5635-	tot = decode[0] * hc[0];
 5636-
 5637-#define stbir__2_coeff_only()                                                  \
 5638-	float tot;                                                                 \
 5639-	tot = decode[0] * hc[0];                                                   \
 5640-	tot += decode[1] * hc[1];
 5641-
 5642-#define stbir__3_coeff_only()                                                  \
 5643-	float tot;                                                                 \
 5644-	tot = decode[0] * hc[0];                                                   \
 5645-	tot += decode[1] * hc[1];                                                  \
 5646-	tot += decode[2] * hc[2];
 5647-
 5648-#define stbir__store_output_tiny()                                             \
 5649-	output[0] = tot;                                                           \
 5650-	horizontal_coefficients += coefficient_width;                              \
 5651-	++horizontal_contributors;                                                 \
 5652-	output += 1;
 5653-
 5654-#define stbir__4_coeff_start()                                                 \
 5655-	float tot0, tot1, tot2, tot3;                                              \
 5656-	tot0 = decode[0] * hc[0];                                                  \
 5657-	tot1 = decode[1] * hc[1];                                                  \
 5658-	tot2 = decode[2] * hc[2];                                                  \
 5659-	tot3 = decode[3] * hc[3];
 5660-
 5661-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 5662-	tot0 += decode[0 + (ofs)] * hc[0 + (ofs)];                                 \
 5663-	tot1 += decode[1 + (ofs)] * hc[1 + (ofs)];                                 \
 5664-	tot2 += decode[2 + (ofs)] * hc[2 + (ofs)];                                 \
 5665-	tot3 += decode[3 + (ofs)] * hc[3 + (ofs)];
 5666-
 5667-#define stbir__1_coeff_remnant(ofs) tot0 += decode[0 + (ofs)] * hc[0 + (ofs)];
 5668-
 5669-#define stbir__2_coeff_remnant(ofs)                                            \
 5670-	tot0 += decode[0 + (ofs)] * hc[0 + (ofs)];                                 \
 5671-	tot1 += decode[1 + (ofs)] * hc[1 + (ofs)];
 5672-
 5673-#define stbir__3_coeff_remnant(ofs)                                            \
 5674-	tot0 += decode[0 + (ofs)] * hc[0 + (ofs)];                                 \
 5675-	tot1 += decode[1 + (ofs)] * hc[1 + (ofs)];                                 \
 5676-	tot2 += decode[2 + (ofs)] * hc[2 + (ofs)];
 5677-
 5678-#define stbir__store_output()                                                  \
 5679-	output[0] = (tot0 + tot2) + (tot1 + tot3);                                 \
 5680-	horizontal_coefficients += coefficient_width;                              \
 5681-	++horizontal_contributors;                                                 \
 5682-	output += 1;
 5683-
 5684-#endif
 5685-
 5686-#define STBIR__horizontal_channels 1
 5687-#define STB_IMAGE_RESIZE_DO_HORIZONTALS
 5688-#include STBIR__HEADER_FILENAME
 5689-
 5690-//=================
 5691-// Do 2 channel horizontal routines
 5692-
 5693-#ifdef STBIR_SIMD
 5694-
 5695-#define stbir__1_coeff_only()                                                  \
 5696-	stbir__simdf tot, c, d;                                                    \
 5697-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5698-	stbir__simdf_load1z(c, hc);                                                \
 5699-	stbir__simdf_0123to0011(c, c);                                             \
 5700-	stbir__simdf_load2(d, decode);                                             \
 5701-	stbir__simdf_mult(tot, d, c);
 5702-
 5703-#define stbir__2_coeff_only()                                                  \
 5704-	stbir__simdf tot, c;                                                       \
 5705-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5706-	stbir__simdf_load2(c, hc);                                                 \
 5707-	stbir__simdf_0123to0011(c, c);                                             \
 5708-	stbir__simdf_mult_mem(tot, c, decode);
 5709-
 5710-#define stbir__3_coeff_only()                                                  \
 5711-	stbir__simdf tot, c, cs, d;                                                \
 5712-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5713-	stbir__simdf_load(cs, hc);                                                 \
 5714-	stbir__simdf_0123to0011(c, cs);                                            \
 5715-	stbir__simdf_mult_mem(tot, c, decode);                                     \
 5716-	stbir__simdf_0123to2222(c, cs);                                            \
 5717-	stbir__simdf_load2z(d, decode + 4);                                        \
 5718-	stbir__simdf_madd(tot, tot, d, c);
 5719-
 5720-#define stbir__store_output_tiny()                                             \
 5721-	stbir__simdf_0123to2301(c, tot);                                           \
 5722-	stbir__simdf_add(tot, tot, c);                                             \
 5723-	stbir__simdf_store2(output, tot);                                          \
 5724-	horizontal_coefficients += coefficient_width;                              \
 5725-	++horizontal_contributors;                                                 \
 5726-	output += 2;
 5727-
 5728-#ifdef STBIR_SIMD8
 5729-
 5730-#define stbir__4_coeff_start()                                                 \
 5731-	stbir__simdf8 tot0, c, cs;                                                 \
 5732-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5733-	stbir__simdf8_load4b(cs, hc);                                              \
 5734-	stbir__simdf8_0123to00112233(c, cs);                                       \
 5735-	stbir__simdf8_mult_mem(tot0, c, decode);
 5736-
 5737-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 5738-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5739-	stbir__simdf8_load4b(cs, hc + (ofs));                                      \
 5740-	stbir__simdf8_0123to00112233(c, cs);                                       \
 5741-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 2);
 5742-
 5743-#define stbir__1_coeff_remnant(ofs)                                            \
 5744-	{                                                                          \
 5745-		stbir__simdf t, d;                                                     \
 5746-		stbir__simdf_load1z(t, hc + (ofs));                                    \
 5747-		stbir__simdf_load2(d, decode + (ofs) * 2);                             \
 5748-		stbir__simdf_0123to0011(t, t);                                         \
 5749-		stbir__simdf_mult(t, t, d);                                            \
 5750-		stbir__simdf8_add4(tot0, tot0, t);                                     \
 5751-	}
 5752-
 5753-#define stbir__2_coeff_remnant(ofs)                                            \
 5754-	{                                                                          \
 5755-		stbir__simdf t;                                                        \
 5756-		stbir__simdf_load2(t, hc + (ofs));                                     \
 5757-		stbir__simdf_0123to0011(t, t);                                         \
 5758-		stbir__simdf_mult_mem(t, t, decode + (ofs) * 2);                       \
 5759-		stbir__simdf8_add4(tot0, tot0, t);                                     \
 5760-	}
 5761-
 5762-#define stbir__3_coeff_remnant(ofs)                                            \
 5763-	{                                                                          \
 5764-		stbir__simdf8 d;                                                       \
 5765-		stbir__simdf8_load4b(cs, hc + (ofs));                                  \
 5766-		stbir__simdf8_0123to00112233(c, cs);                                   \
 5767-		stbir__simdf8_load6z(d, decode + (ofs) * 2);                           \
 5768-		stbir__simdf8_madd(tot0, tot0, c, d);                                  \
 5769-	}
 5770-
 5771-#define stbir__store_output()                                                  \
 5772-	{                                                                          \
 5773-		stbir__simdf t, d;                                                     \
 5774-		stbir__simdf8_add4halves(t, stbir__if_simdf8_cast_to_simdf4(tot0),     \
 5775-		                         tot0);                                        \
 5776-		stbir__simdf_0123to2301(d, t);                                         \
 5777-		stbir__simdf_add(t, t, d);                                             \
 5778-		stbir__simdf_store2(output, t);                                        \
 5779-		horizontal_coefficients += coefficient_width;                          \
 5780-		++horizontal_contributors;                                             \
 5781-		output += 2;                                                           \
 5782-	}
 5783-
 5784-#else
 5785-
 5786-#define stbir__4_coeff_start()                                                 \
 5787-	stbir__simdf tot0, tot1, c, cs;                                            \
 5788-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5789-	stbir__simdf_load(cs, hc);                                                 \
 5790-	stbir__simdf_0123to0011(c, cs);                                            \
 5791-	stbir__simdf_mult_mem(tot0, c, decode);                                    \
 5792-	stbir__simdf_0123to2233(c, cs);                                            \
 5793-	stbir__simdf_mult_mem(tot1, c, decode + 4);
 5794-
 5795-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 5796-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5797-	stbir__simdf_load(cs, hc + (ofs));                                         \
 5798-	stbir__simdf_0123to0011(c, cs);                                            \
 5799-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 2);                  \
 5800-	stbir__simdf_0123to2233(c, cs);                                            \
 5801-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 2 + 4);
 5802-
 5803-#define stbir__1_coeff_remnant(ofs)                                            \
 5804-	{                                                                          \
 5805-		stbir__simdf d;                                                        \
 5806-		stbir__simdf_load1z(cs, hc + (ofs));                                   \
 5807-		stbir__simdf_0123to0011(c, cs);                                        \
 5808-		stbir__simdf_load2(d, decode + (ofs) * 2);                             \
 5809-		stbir__simdf_madd(tot0, tot0, d, c);                                   \
 5810-	}
 5811-
 5812-#define stbir__2_coeff_remnant(ofs)                                            \
 5813-	stbir__simdf_load2(cs, hc + (ofs));                                        \
 5814-	stbir__simdf_0123to0011(c, cs);                                            \
 5815-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 2);
 5816-
 5817-#define stbir__3_coeff_remnant(ofs)                                            \
 5818-	{                                                                          \
 5819-		stbir__simdf d;                                                        \
 5820-		stbir__simdf_load(cs, hc + (ofs));                                     \
 5821-		stbir__simdf_0123to0011(c, cs);                                        \
 5822-		stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 2);              \
 5823-		stbir__simdf_0123to2222(c, cs);                                        \
 5824-		stbir__simdf_load2z(d, decode + (ofs) * 2 + 4);                        \
 5825-		stbir__simdf_madd(tot1, tot1, d, c);                                   \
 5826-	}
 5827-
 5828-#define stbir__store_output()                                                  \
 5829-	stbir__simdf_add(tot0, tot0, tot1);                                        \
 5830-	stbir__simdf_0123to2301(c, tot0);                                          \
 5831-	stbir__simdf_add(tot0, tot0, c);                                           \
 5832-	stbir__simdf_store2(output, tot0);                                         \
 5833-	horizontal_coefficients += coefficient_width;                              \
 5834-	++horizontal_contributors;                                                 \
 5835-	output += 2;
 5836-
 5837-#endif
 5838-
 5839-#else
 5840-
 5841-#define stbir__1_coeff_only()                                                  \
 5842-	float tota, totb, c;                                                       \
 5843-	c = hc[0];                                                                 \
 5844-	tota = decode[0] * c;                                                      \
 5845-	totb = decode[1] * c;
 5846-
 5847-#define stbir__2_coeff_only()                                                  \
 5848-	float tota, totb, c;                                                       \
 5849-	c = hc[0];                                                                 \
 5850-	tota = decode[0] * c;                                                      \
 5851-	totb = decode[1] * c;                                                      \
 5852-	c = hc[1];                                                                 \
 5853-	tota += decode[2] * c;                                                     \
 5854-	totb += decode[3] * c;
 5855-
 5856-// this weird order of add matches the simd
 5857-#define stbir__3_coeff_only()                                                  \
 5858-	float tota, totb, c;                                                       \
 5859-	c = hc[0];                                                                 \
 5860-	tota = decode[0] * c;                                                      \
 5861-	totb = decode[1] * c;                                                      \
 5862-	c = hc[2];                                                                 \
 5863-	tota += decode[4] * c;                                                     \
 5864-	totb += decode[5] * c;                                                     \
 5865-	c = hc[1];                                                                 \
 5866-	tota += decode[2] * c;                                                     \
 5867-	totb += decode[3] * c;
 5868-
 5869-#define stbir__store_output_tiny()                                             \
 5870-	output[0] = tota;                                                          \
 5871-	output[1] = totb;                                                          \
 5872-	horizontal_coefficients += coefficient_width;                              \
 5873-	++horizontal_contributors;                                                 \
 5874-	output += 2;
 5875-
 5876-#define stbir__4_coeff_start()                                                 \
 5877-	float tota0, tota1, tota2, tota3, totb0, totb1, totb2, totb3, c;           \
 5878-	c = hc[0];                                                                 \
 5879-	tota0 = decode[0] * c;                                                     \
 5880-	totb0 = decode[1] * c;                                                     \
 5881-	c = hc[1];                                                                 \
 5882-	tota1 = decode[2] * c;                                                     \
 5883-	totb1 = decode[3] * c;                                                     \
 5884-	c = hc[2];                                                                 \
 5885-	tota2 = decode[4] * c;                                                     \
 5886-	totb2 = decode[5] * c;                                                     \
 5887-	c = hc[3];                                                                 \
 5888-	tota3 = decode[6] * c;                                                     \
 5889-	totb3 = decode[7] * c;
 5890-
 5891-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 5892-	c = hc[0 + (ofs)];                                                         \
 5893-	tota0 += decode[0 + (ofs) * 2] * c;                                        \
 5894-	totb0 += decode[1 + (ofs) * 2] * c;                                        \
 5895-	c = hc[1 + (ofs)];                                                         \
 5896-	tota1 += decode[2 + (ofs) * 2] * c;                                        \
 5897-	totb1 += decode[3 + (ofs) * 2] * c;                                        \
 5898-	c = hc[2 + (ofs)];                                                         \
 5899-	tota2 += decode[4 + (ofs) * 2] * c;                                        \
 5900-	totb2 += decode[5 + (ofs) * 2] * c;                                        \
 5901-	c = hc[3 + (ofs)];                                                         \
 5902-	tota3 += decode[6 + (ofs) * 2] * c;                                        \
 5903-	totb3 += decode[7 + (ofs) * 2] * c;
 5904-
 5905-#define stbir__1_coeff_remnant(ofs)                                            \
 5906-	c = hc[0 + (ofs)];                                                         \
 5907-	tota0 += decode[0 + (ofs) * 2] * c;                                        \
 5908-	totb0 += decode[1 + (ofs) * 2] * c;
 5909-
 5910-#define stbir__2_coeff_remnant(ofs)                                            \
 5911-	c = hc[0 + (ofs)];                                                         \
 5912-	tota0 += decode[0 + (ofs) * 2] * c;                                        \
 5913-	totb0 += decode[1 + (ofs) * 2] * c;                                        \
 5914-	c = hc[1 + (ofs)];                                                         \
 5915-	tota1 += decode[2 + (ofs) * 2] * c;                                        \
 5916-	totb1 += decode[3 + (ofs) * 2] * c;
 5917-
 5918-#define stbir__3_coeff_remnant(ofs)                                            \
 5919-	c = hc[0 + (ofs)];                                                         \
 5920-	tota0 += decode[0 + (ofs) * 2] * c;                                        \
 5921-	totb0 += decode[1 + (ofs) * 2] * c;                                        \
 5922-	c = hc[1 + (ofs)];                                                         \
 5923-	tota1 += decode[2 + (ofs) * 2] * c;                                        \
 5924-	totb1 += decode[3 + (ofs) * 2] * c;                                        \
 5925-	c = hc[2 + (ofs)];                                                         \
 5926-	tota2 += decode[4 + (ofs) * 2] * c;                                        \
 5927-	totb2 += decode[5 + (ofs) * 2] * c;
 5928-
 5929-#define stbir__store_output()                                                  \
 5930-	output[0] = (tota0 + tota2) + (tota1 + tota3);                             \
 5931-	output[1] = (totb0 + totb2) + (totb1 + totb3);                             \
 5932-	horizontal_coefficients += coefficient_width;                              \
 5933-	++horizontal_contributors;                                                 \
 5934-	output += 2;
 5935-
 5936-#endif
 5937-
 5938-#define STBIR__horizontal_channels 2
 5939-#define STB_IMAGE_RESIZE_DO_HORIZONTALS
 5940-#include STBIR__HEADER_FILENAME
 5941-
 5942-//=================
 5943-// Do 3 channel horizontal routines
 5944-
 5945-#ifdef STBIR_SIMD
 5946-
 5947-#define stbir__1_coeff_only()                                                  \
 5948-	stbir__simdf tot, c, d;                                                    \
 5949-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5950-	stbir__simdf_load1z(c, hc);                                                \
 5951-	stbir__simdf_0123to0001(c, c);                                             \
 5952-	stbir__simdf_load(d, decode);                                              \
 5953-	stbir__simdf_mult(tot, d, c);
 5954-
 5955-#define stbir__2_coeff_only()                                                  \
 5956-	stbir__simdf tot, c, cs, d;                                                \
 5957-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5958-	stbir__simdf_load2(cs, hc);                                                \
 5959-	stbir__simdf_0123to0000(c, cs);                                            \
 5960-	stbir__simdf_load(d, decode);                                              \
 5961-	stbir__simdf_mult(tot, d, c);                                              \
 5962-	stbir__simdf_0123to1111(c, cs);                                            \
 5963-	stbir__simdf_load(d, decode + 3);                                          \
 5964-	stbir__simdf_madd(tot, tot, d, c);
 5965-
 5966-#define stbir__3_coeff_only()                                                  \
 5967-	stbir__simdf tot, c, d, cs;                                                \
 5968-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5969-	stbir__simdf_load(cs, hc);                                                 \
 5970-	stbir__simdf_0123to0000(c, cs);                                            \
 5971-	stbir__simdf_load(d, decode);                                              \
 5972-	stbir__simdf_mult(tot, d, c);                                              \
 5973-	stbir__simdf_0123to1111(c, cs);                                            \
 5974-	stbir__simdf_load(d, decode + 3);                                          \
 5975-	stbir__simdf_madd(tot, tot, d, c);                                         \
 5976-	stbir__simdf_0123to2222(c, cs);                                            \
 5977-	stbir__simdf_load(d, decode + 6);                                          \
 5978-	stbir__simdf_madd(tot, tot, d, c);
 5979-
 5980-#define stbir__store_output_tiny()                                             \
 5981-	stbir__simdf_store2(output, tot);                                          \
 5982-	stbir__simdf_0123to2301(tot, tot);                                         \
 5983-	stbir__simdf_store1(output + 2, tot);                                      \
 5984-	horizontal_coefficients += coefficient_width;                              \
 5985-	++horizontal_contributors;                                                 \
 5986-	output += 3;
 5987-
 5988-#ifdef STBIR_SIMD8
 5989-
 5990-// we're loading from the XXXYYY decode by -1 to get the XXXYYY into different
 5991-// halves of the AVX reg fyi
 5992-#define stbir__4_coeff_start()                                                 \
 5993-	stbir__simdf8 tot0, tot1, c, cs;                                           \
 5994-	stbir__simdf t;                                                            \
 5995-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 5996-	stbir__simdf8_load4b(cs, hc);                                              \
 5997-	stbir__simdf8_0123to00001111(c, cs);                                       \
 5998-	stbir__simdf8_mult_mem(tot0, c, decode - 1);                               \
 5999-	stbir__simdf8_0123to22223333(c, cs);                                       \
 6000-	stbir__simdf8_mult_mem(tot1, c, decode + 6 - 1);
 6001-
 6002-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 6003-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6004-	stbir__simdf8_load4b(cs, hc + (ofs));                                      \
 6005-	stbir__simdf8_0123to00001111(c, cs);                                       \
 6006-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 3 - 1);             \
 6007-	stbir__simdf8_0123to22223333(c, cs);                                       \
 6008-	stbir__simdf8_madd_mem(tot1, tot1, c, decode + (ofs) * 3 + 6 - 1);
 6009-
 6010-#define stbir__1_coeff_remnant(ofs)                                            \
 6011-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6012-	stbir__simdf_load1rep4(t, hc + (ofs));                                     \
 6013-	stbir__simdf8_madd_mem4(tot0, tot0, t, decode + (ofs) * 3 - 1);
 6014-
 6015-#define stbir__2_coeff_remnant(ofs)                                            \
 6016-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6017-	stbir__simdf8_load4b(cs, hc + (ofs) - 2);                                  \
 6018-	stbir__simdf8_0123to22223333(c, cs);                                       \
 6019-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 3 - 1);
 6020-
 6021-#define stbir__3_coeff_remnant(ofs)                                            \
 6022-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6023-	stbir__simdf8_load4b(cs, hc + (ofs));                                      \
 6024-	stbir__simdf8_0123to00001111(c, cs);                                       \
 6025-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 3 - 1);             \
 6026-	stbir__simdf8_0123to2222(t, cs);                                           \
 6027-	stbir__simdf8_madd_mem4(tot1, tot1, t, decode + (ofs) * 3 + 6 - 1);
 6028-
 6029-#define stbir__store_output()                                                  \
 6030-	stbir__simdf8_add(tot0, tot0, tot1);                                       \
 6031-	stbir__simdf_0123to1230(t, stbir__if_simdf8_cast_to_simdf4(tot0));         \
 6032-	stbir__simdf8_add4halves(t, t, tot0);                                      \
 6033-	horizontal_coefficients += coefficient_width;                              \
 6034-	++horizontal_contributors;                                                 \
 6035-	output += 3;                                                               \
 6036-	if (output < output_end) {                                                 \
 6037-		stbir__simdf_store(output - 3, t);                                     \
 6038-		continue;                                                              \
 6039-	}                                                                          \
 6040-	{                                                                          \
 6041-		stbir__simdf tt;                                                       \
 6042-		stbir__simdf_0123to2301(tt, t);                                        \
 6043-		stbir__simdf_store2(output - 3, t);                                    \
 6044-		stbir__simdf_store1(output + 2 - 3, tt);                               \
 6045-	}                                                                          \
 6046-	break;
 6047-
 6048-#else
 6049-
 6050-#define stbir__4_coeff_start()                                                 \
 6051-	stbir__simdf tot0, tot1, tot2, c, cs;                                      \
 6052-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6053-	stbir__simdf_load(cs, hc);                                                 \
 6054-	stbir__simdf_0123to0001(c, cs);                                            \
 6055-	stbir__simdf_mult_mem(tot0, c, decode);                                    \
 6056-	stbir__simdf_0123to1122(c, cs);                                            \
 6057-	stbir__simdf_mult_mem(tot1, c, decode + 4);                                \
 6058-	stbir__simdf_0123to2333(c, cs);                                            \
 6059-	stbir__simdf_mult_mem(tot2, c, decode + 8);
 6060-
 6061-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 6062-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6063-	stbir__simdf_load(cs, hc + (ofs));                                         \
 6064-	stbir__simdf_0123to0001(c, cs);                                            \
 6065-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 3);                  \
 6066-	stbir__simdf_0123to1122(c, cs);                                            \
 6067-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 3 + 4);              \
 6068-	stbir__simdf_0123to2333(c, cs);                                            \
 6069-	stbir__simdf_madd_mem(tot2, tot2, c, decode + (ofs) * 3 + 8);
 6070-
 6071-#define stbir__1_coeff_remnant(ofs)                                            \
 6072-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6073-	stbir__simdf_load1z(c, hc + (ofs));                                        \
 6074-	stbir__simdf_0123to0001(c, c);                                             \
 6075-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 3);
 6076-
 6077-#define stbir__2_coeff_remnant(ofs)                                            \
 6078-	{                                                                          \
 6079-		stbir__simdf d;                                                        \
 6080-		STBIR_SIMD_NO_UNROLL(decode);                                          \
 6081-		stbir__simdf_load2z(cs, hc + (ofs));                                   \
 6082-		stbir__simdf_0123to0001(c, cs);                                        \
 6083-		stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 3);              \
 6084-		stbir__simdf_0123to1122(c, cs);                                        \
 6085-		stbir__simdf_load2z(d, decode + (ofs) * 3 + 4);                        \
 6086-		stbir__simdf_madd(tot1, tot1, c, d);                                   \
 6087-	}
 6088-
 6089-#define stbir__3_coeff_remnant(ofs)                                            \
 6090-	{                                                                          \
 6091-		stbir__simdf d;                                                        \
 6092-		STBIR_SIMD_NO_UNROLL(decode);                                          \
 6093-		stbir__simdf_load(cs, hc + (ofs));                                     \
 6094-		stbir__simdf_0123to0001(c, cs);                                        \
 6095-		stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 3);              \
 6096-		stbir__simdf_0123to1122(c, cs);                                        \
 6097-		stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 3 + 4);          \
 6098-		stbir__simdf_0123to2222(c, cs);                                        \
 6099-		stbir__simdf_load1z(d, decode + (ofs) * 3 + 8);                        \
 6100-		stbir__simdf_madd(tot2, tot2, c, d);                                   \
 6101-	}
 6102-
 6103-#define stbir__store_output()                                                  \
 6104-	stbir__simdf_0123ABCDto3ABx(c, tot0, tot1);                                \
 6105-	stbir__simdf_0123ABCDto23Ax(cs, tot1, tot2);                               \
 6106-	stbir__simdf_0123to1230(tot2, tot2);                                       \
 6107-	stbir__simdf_add(tot0, tot0, cs);                                          \
 6108-	stbir__simdf_add(c, c, tot2);                                              \
 6109-	stbir__simdf_add(tot0, tot0, c);                                           \
 6110-	horizontal_coefficients += coefficient_width;                              \
 6111-	++horizontal_contributors;                                                 \
 6112-	output += 3;                                                               \
 6113-	if (output < output_end) {                                                 \
 6114-		stbir__simdf_store(output - 3, tot0);                                  \
 6115-		continue;                                                              \
 6116-	}                                                                          \
 6117-	stbir__simdf_0123to2301(tot1, tot0);                                       \
 6118-	stbir__simdf_store2(output - 3, tot0);                                     \
 6119-	stbir__simdf_store1(output + 2 - 3, tot1);                                 \
 6120-	break;
 6121-
 6122-#endif
 6123-
 6124-#else
 6125-
 6126-#define stbir__1_coeff_only()                                                  \
 6127-	float tot0, tot1, tot2, c;                                                 \
 6128-	c = hc[0];                                                                 \
 6129-	tot0 = decode[0] * c;                                                      \
 6130-	tot1 = decode[1] * c;                                                      \
 6131-	tot2 = decode[2] * c;
 6132-
 6133-#define stbir__2_coeff_only()                                                  \
 6134-	float tot0, tot1, tot2, c;                                                 \
 6135-	c = hc[0];                                                                 \
 6136-	tot0 = decode[0] * c;                                                      \
 6137-	tot1 = decode[1] * c;                                                      \
 6138-	tot2 = decode[2] * c;                                                      \
 6139-	c = hc[1];                                                                 \
 6140-	tot0 += decode[3] * c;                                                     \
 6141-	tot1 += decode[4] * c;                                                     \
 6142-	tot2 += decode[5] * c;
 6143-
 6144-#define stbir__3_coeff_only()                                                  \
 6145-	float tot0, tot1, tot2, c;                                                 \
 6146-	c = hc[0];                                                                 \
 6147-	tot0 = decode[0] * c;                                                      \
 6148-	tot1 = decode[1] * c;                                                      \
 6149-	tot2 = decode[2] * c;                                                      \
 6150-	c = hc[1];                                                                 \
 6151-	tot0 += decode[3] * c;                                                     \
 6152-	tot1 += decode[4] * c;                                                     \
 6153-	tot2 += decode[5] * c;                                                     \
 6154-	c = hc[2];                                                                 \
 6155-	tot0 += decode[6] * c;                                                     \
 6156-	tot1 += decode[7] * c;                                                     \
 6157-	tot2 += decode[8] * c;
 6158-
 6159-#define stbir__store_output_tiny()                                             \
 6160-	output[0] = tot0;                                                          \
 6161-	output[1] = tot1;                                                          \
 6162-	output[2] = tot2;                                                          \
 6163-	horizontal_coefficients += coefficient_width;                              \
 6164-	++horizontal_contributors;                                                 \
 6165-	output += 3;
 6166-
 6167-#define stbir__4_coeff_start()                                                 \
 6168-	float tota0, tota1, tota2, totb0, totb1, totb2, totc0, totc1, totc2,       \
 6169-	    totd0, totd1, totd2, c;                                                \
 6170-	c = hc[0];                                                                 \
 6171-	tota0 = decode[0] * c;                                                     \
 6172-	tota1 = decode[1] * c;                                                     \
 6173-	tota2 = decode[2] * c;                                                     \
 6174-	c = hc[1];                                                                 \
 6175-	totb0 = decode[3] * c;                                                     \
 6176-	totb1 = decode[4] * c;                                                     \
 6177-	totb2 = decode[5] * c;                                                     \
 6178-	c = hc[2];                                                                 \
 6179-	totc0 = decode[6] * c;                                                     \
 6180-	totc1 = decode[7] * c;                                                     \
 6181-	totc2 = decode[8] * c;                                                     \
 6182-	c = hc[3];                                                                 \
 6183-	totd0 = decode[9] * c;                                                     \
 6184-	totd1 = decode[10] * c;                                                    \
 6185-	totd2 = decode[11] * c;
 6186-
 6187-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 6188-	c = hc[0 + (ofs)];                                                         \
 6189-	tota0 += decode[0 + (ofs) * 3] * c;                                        \
 6190-	tota1 += decode[1 + (ofs) * 3] * c;                                        \
 6191-	tota2 += decode[2 + (ofs) * 3] * c;                                        \
 6192-	c = hc[1 + (ofs)];                                                         \
 6193-	totb0 += decode[3 + (ofs) * 3] * c;                                        \
 6194-	totb1 += decode[4 + (ofs) * 3] * c;                                        \
 6195-	totb2 += decode[5 + (ofs) * 3] * c;                                        \
 6196-	c = hc[2 + (ofs)];                                                         \
 6197-	totc0 += decode[6 + (ofs) * 3] * c;                                        \
 6198-	totc1 += decode[7 + (ofs) * 3] * c;                                        \
 6199-	totc2 += decode[8 + (ofs) * 3] * c;                                        \
 6200-	c = hc[3 + (ofs)];                                                         \
 6201-	totd0 += decode[9 + (ofs) * 3] * c;                                        \
 6202-	totd1 += decode[10 + (ofs) * 3] * c;                                       \
 6203-	totd2 += decode[11 + (ofs) * 3] * c;
 6204-
 6205-#define stbir__1_coeff_remnant(ofs)                                            \
 6206-	c = hc[0 + (ofs)];                                                         \
 6207-	tota0 += decode[0 + (ofs) * 3] * c;                                        \
 6208-	tota1 += decode[1 + (ofs) * 3] * c;                                        \
 6209-	tota2 += decode[2 + (ofs) * 3] * c;
 6210-
 6211-#define stbir__2_coeff_remnant(ofs)                                            \
 6212-	c = hc[0 + (ofs)];                                                         \
 6213-	tota0 += decode[0 + (ofs) * 3] * c;                                        \
 6214-	tota1 += decode[1 + (ofs) * 3] * c;                                        \
 6215-	tota2 += decode[2 + (ofs) * 3] * c;                                        \
 6216-	c = hc[1 + (ofs)];                                                         \
 6217-	totb0 += decode[3 + (ofs) * 3] * c;                                        \
 6218-	totb1 += decode[4 + (ofs) * 3] * c;                                        \
 6219-	totb2 += decode[5 + (ofs) * 3] * c;
 6220-
 6221-#define stbir__3_coeff_remnant(ofs)                                            \
 6222-	c = hc[0 + (ofs)];                                                         \
 6223-	tota0 += decode[0 + (ofs) * 3] * c;                                        \
 6224-	tota1 += decode[1 + (ofs) * 3] * c;                                        \
 6225-	tota2 += decode[2 + (ofs) * 3] * c;                                        \
 6226-	c = hc[1 + (ofs)];                                                         \
 6227-	totb0 += decode[3 + (ofs) * 3] * c;                                        \
 6228-	totb1 += decode[4 + (ofs) * 3] * c;                                        \
 6229-	totb2 += decode[5 + (ofs) * 3] * c;                                        \
 6230-	c = hc[2 + (ofs)];                                                         \
 6231-	totc0 += decode[6 + (ofs) * 3] * c;                                        \
 6232-	totc1 += decode[7 + (ofs) * 3] * c;                                        \
 6233-	totc2 += decode[8 + (ofs) * 3] * c;
 6234-
 6235-#define stbir__store_output()                                                  \
 6236-	output[0] = (tota0 + totc0) + (totb0 + totd0);                             \
 6237-	output[1] = (tota1 + totc1) + (totb1 + totd1);                             \
 6238-	output[2] = (tota2 + totc2) + (totb2 + totd2);                             \
 6239-	horizontal_coefficients += coefficient_width;                              \
 6240-	++horizontal_contributors;                                                 \
 6241-	output += 3;
 6242-
 6243-#endif
 6244-
 6245-#define STBIR__horizontal_channels 3
 6246-#define STB_IMAGE_RESIZE_DO_HORIZONTALS
 6247-#include STBIR__HEADER_FILENAME
 6248-
 6249-//=================
 6250-// Do 4 channel horizontal routines
 6251-
 6252-#ifdef STBIR_SIMD
 6253-
 6254-#define stbir__1_coeff_only()                                                  \
 6255-	stbir__simdf tot, c;                                                       \
 6256-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6257-	stbir__simdf_load1(c, hc);                                                 \
 6258-	stbir__simdf_0123to0000(c, c);                                             \
 6259-	stbir__simdf_mult_mem(tot, c, decode);
 6260-
 6261-#define stbir__2_coeff_only()                                                  \
 6262-	stbir__simdf tot, c, cs;                                                   \
 6263-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6264-	stbir__simdf_load2(cs, hc);                                                \
 6265-	stbir__simdf_0123to0000(c, cs);                                            \
 6266-	stbir__simdf_mult_mem(tot, c, decode);                                     \
 6267-	stbir__simdf_0123to1111(c, cs);                                            \
 6268-	stbir__simdf_madd_mem(tot, tot, c, decode + 4);
 6269-
 6270-#define stbir__3_coeff_only()                                                  \
 6271-	stbir__simdf tot, c, cs;                                                   \
 6272-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6273-	stbir__simdf_load(cs, hc);                                                 \
 6274-	stbir__simdf_0123to0000(c, cs);                                            \
 6275-	stbir__simdf_mult_mem(tot, c, decode);                                     \
 6276-	stbir__simdf_0123to1111(c, cs);                                            \
 6277-	stbir__simdf_madd_mem(tot, tot, c, decode + 4);                            \
 6278-	stbir__simdf_0123to2222(c, cs);                                            \
 6279-	stbir__simdf_madd_mem(tot, tot, c, decode + 8);
 6280-
 6281-#define stbir__store_output_tiny()                                             \
 6282-	stbir__simdf_store(output, tot);                                           \
 6283-	horizontal_coefficients += coefficient_width;                              \
 6284-	++horizontal_contributors;                                                 \
 6285-	output += 4;
 6286-
 6287-#ifdef STBIR_SIMD8
 6288-
 6289-#define stbir__4_coeff_start()                                                 \
 6290-	stbir__simdf8 tot0, c, cs;                                                 \
 6291-	stbir__simdf t;                                                            \
 6292-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6293-	stbir__simdf8_load4b(cs, hc);                                              \
 6294-	stbir__simdf8_0123to00001111(c, cs);                                       \
 6295-	stbir__simdf8_mult_mem(tot0, c, decode);                                   \
 6296-	stbir__simdf8_0123to22223333(c, cs);                                       \
 6297-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + 8);
 6298-
 6299-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 6300-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6301-	stbir__simdf8_load4b(cs, hc + (ofs));                                      \
 6302-	stbir__simdf8_0123to00001111(c, cs);                                       \
 6303-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 4);                 \
 6304-	stbir__simdf8_0123to22223333(c, cs);                                       \
 6305-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 4 + 8);
 6306-
 6307-#define stbir__1_coeff_remnant(ofs)                                            \
 6308-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6309-	stbir__simdf_load1rep4(t, hc + (ofs));                                     \
 6310-	stbir__simdf8_madd_mem4(tot0, tot0, t, decode + (ofs) * 4);
 6311-
 6312-#define stbir__2_coeff_remnant(ofs)                                            \
 6313-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6314-	stbir__simdf8_load4b(cs, hc + (ofs) - 2);                                  \
 6315-	stbir__simdf8_0123to22223333(c, cs);                                       \
 6316-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 4);
 6317-
 6318-#define stbir__3_coeff_remnant(ofs)                                            \
 6319-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6320-	stbir__simdf8_load4b(cs, hc + (ofs));                                      \
 6321-	stbir__simdf8_0123to00001111(c, cs);                                       \
 6322-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 4);                 \
 6323-	stbir__simdf8_0123to2222(t, cs);                                           \
 6324-	stbir__simdf8_madd_mem4(tot0, tot0, t, decode + (ofs) * 4 + 8);
 6325-
 6326-#define stbir__store_output()                                                  \
 6327-	stbir__simdf8_add4halves(t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0);  \
 6328-	stbir__simdf_store(output, t);                                             \
 6329-	horizontal_coefficients += coefficient_width;                              \
 6330-	++horizontal_contributors;                                                 \
 6331-	output += 4;
 6332-
 6333-#else
 6334-
 6335-#define stbir__4_coeff_start()                                                 \
 6336-	stbir__simdf tot0, tot1, c, cs;                                            \
 6337-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6338-	stbir__simdf_load(cs, hc);                                                 \
 6339-	stbir__simdf_0123to0000(c, cs);                                            \
 6340-	stbir__simdf_mult_mem(tot0, c, decode);                                    \
 6341-	stbir__simdf_0123to1111(c, cs);                                            \
 6342-	stbir__simdf_mult_mem(tot1, c, decode + 4);                                \
 6343-	stbir__simdf_0123to2222(c, cs);                                            \
 6344-	stbir__simdf_madd_mem(tot0, tot0, c, decode + 8);                          \
 6345-	stbir__simdf_0123to3333(c, cs);                                            \
 6346-	stbir__simdf_madd_mem(tot1, tot1, c, decode + 12);
 6347-
 6348-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 6349-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6350-	stbir__simdf_load(cs, hc + (ofs));                                         \
 6351-	stbir__simdf_0123to0000(c, cs);                                            \
 6352-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 4);                  \
 6353-	stbir__simdf_0123to1111(c, cs);                                            \
 6354-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 4 + 4);              \
 6355-	stbir__simdf_0123to2222(c, cs);                                            \
 6356-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 4 + 8);              \
 6357-	stbir__simdf_0123to3333(c, cs);                                            \
 6358-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 4 + 12);
 6359-
 6360-#define stbir__1_coeff_remnant(ofs)                                            \
 6361-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6362-	stbir__simdf_load1(c, hc + (ofs));                                         \
 6363-	stbir__simdf_0123to0000(c, c);                                             \
 6364-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 4);
 6365-
 6366-#define stbir__2_coeff_remnant(ofs)                                            \
 6367-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6368-	stbir__simdf_load2(cs, hc + (ofs));                                        \
 6369-	stbir__simdf_0123to0000(c, cs);                                            \
 6370-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 4);                  \
 6371-	stbir__simdf_0123to1111(c, cs);                                            \
 6372-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 4 + 4);
 6373-
 6374-#define stbir__3_coeff_remnant(ofs)                                            \
 6375-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6376-	stbir__simdf_load(cs, hc + (ofs));                                         \
 6377-	stbir__simdf_0123to0000(c, cs);                                            \
 6378-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 4);                  \
 6379-	stbir__simdf_0123to1111(c, cs);                                            \
 6380-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 4 + 4);              \
 6381-	stbir__simdf_0123to2222(c, cs);                                            \
 6382-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 4 + 8);
 6383-
 6384-#define stbir__store_output()                                                  \
 6385-	stbir__simdf_add(tot0, tot0, tot1);                                        \
 6386-	stbir__simdf_store(output, tot0);                                          \
 6387-	horizontal_coefficients += coefficient_width;                              \
 6388-	++horizontal_contributors;                                                 \
 6389-	output += 4;
 6390-
 6391-#endif
 6392-
 6393-#else
 6394-
 6395-#define stbir__1_coeff_only()                                                  \
 6396-	float p0, p1, p2, p3, c;                                                   \
 6397-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6398-	c = hc[0];                                                                 \
 6399-	p0 = decode[0] * c;                                                        \
 6400-	p1 = decode[1] * c;                                                        \
 6401-	p2 = decode[2] * c;                                                        \
 6402-	p3 = decode[3] * c;
 6403-
 6404-#define stbir__2_coeff_only()                                                  \
 6405-	float p0, p1, p2, p3, c;                                                   \
 6406-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6407-	c = hc[0];                                                                 \
 6408-	p0 = decode[0] * c;                                                        \
 6409-	p1 = decode[1] * c;                                                        \
 6410-	p2 = decode[2] * c;                                                        \
 6411-	p3 = decode[3] * c;                                                        \
 6412-	c = hc[1];                                                                 \
 6413-	p0 += decode[4] * c;                                                       \
 6414-	p1 += decode[5] * c;                                                       \
 6415-	p2 += decode[6] * c;                                                       \
 6416-	p3 += decode[7] * c;
 6417-
 6418-#define stbir__3_coeff_only()                                                  \
 6419-	float p0, p1, p2, p3, c;                                                   \
 6420-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6421-	c = hc[0];                                                                 \
 6422-	p0 = decode[0] * c;                                                        \
 6423-	p1 = decode[1] * c;                                                        \
 6424-	p2 = decode[2] * c;                                                        \
 6425-	p3 = decode[3] * c;                                                        \
 6426-	c = hc[1];                                                                 \
 6427-	p0 += decode[4] * c;                                                       \
 6428-	p1 += decode[5] * c;                                                       \
 6429-	p2 += decode[6] * c;                                                       \
 6430-	p3 += decode[7] * c;                                                       \
 6431-	c = hc[2];                                                                 \
 6432-	p0 += decode[8] * c;                                                       \
 6433-	p1 += decode[9] * c;                                                       \
 6434-	p2 += decode[10] * c;                                                      \
 6435-	p3 += decode[11] * c;
 6436-
 6437-#define stbir__store_output_tiny()                                             \
 6438-	output[0] = p0;                                                            \
 6439-	output[1] = p1;                                                            \
 6440-	output[2] = p2;                                                            \
 6441-	output[3] = p3;                                                            \
 6442-	horizontal_coefficients += coefficient_width;                              \
 6443-	++horizontal_contributors;                                                 \
 6444-	output += 4;
 6445-
 6446-#define stbir__4_coeff_start()                                                 \
 6447-	float x0, x1, x2, x3, y0, y1, y2, y3, c;                                   \
 6448-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6449-	c = hc[0];                                                                 \
 6450-	x0 = decode[0] * c;                                                        \
 6451-	x1 = decode[1] * c;                                                        \
 6452-	x2 = decode[2] * c;                                                        \
 6453-	x3 = decode[3] * c;                                                        \
 6454-	c = hc[1];                                                                 \
 6455-	y0 = decode[4] * c;                                                        \
 6456-	y1 = decode[5] * c;                                                        \
 6457-	y2 = decode[6] * c;                                                        \
 6458-	y3 = decode[7] * c;                                                        \
 6459-	c = hc[2];                                                                 \
 6460-	x0 += decode[8] * c;                                                       \
 6461-	x1 += decode[9] * c;                                                       \
 6462-	x2 += decode[10] * c;                                                      \
 6463-	x3 += decode[11] * c;                                                      \
 6464-	c = hc[3];                                                                 \
 6465-	y0 += decode[12] * c;                                                      \
 6466-	y1 += decode[13] * c;                                                      \
 6467-	y2 += decode[14] * c;                                                      \
 6468-	y3 += decode[15] * c;
 6469-
 6470-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 6471-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6472-	c = hc[0 + (ofs)];                                                         \
 6473-	x0 += decode[0 + (ofs) * 4] * c;                                           \
 6474-	x1 += decode[1 + (ofs) * 4] * c;                                           \
 6475-	x2 += decode[2 + (ofs) * 4] * c;                                           \
 6476-	x3 += decode[3 + (ofs) * 4] * c;                                           \
 6477-	c = hc[1 + (ofs)];                                                         \
 6478-	y0 += decode[4 + (ofs) * 4] * c;                                           \
 6479-	y1 += decode[5 + (ofs) * 4] * c;                                           \
 6480-	y2 += decode[6 + (ofs) * 4] * c;                                           \
 6481-	y3 += decode[7 + (ofs) * 4] * c;                                           \
 6482-	c = hc[2 + (ofs)];                                                         \
 6483-	x0 += decode[8 + (ofs) * 4] * c;                                           \
 6484-	x1 += decode[9 + (ofs) * 4] * c;                                           \
 6485-	x2 += decode[10 + (ofs) * 4] * c;                                          \
 6486-	x3 += decode[11 + (ofs) * 4] * c;                                          \
 6487-	c = hc[3 + (ofs)];                                                         \
 6488-	y0 += decode[12 + (ofs) * 4] * c;                                          \
 6489-	y1 += decode[13 + (ofs) * 4] * c;                                          \
 6490-	y2 += decode[14 + (ofs) * 4] * c;                                          \
 6491-	y3 += decode[15 + (ofs) * 4] * c;
 6492-
 6493-#define stbir__1_coeff_remnant(ofs)                                            \
 6494-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6495-	c = hc[0 + (ofs)];                                                         \
 6496-	x0 += decode[0 + (ofs) * 4] * c;                                           \
 6497-	x1 += decode[1 + (ofs) * 4] * c;                                           \
 6498-	x2 += decode[2 + (ofs) * 4] * c;                                           \
 6499-	x3 += decode[3 + (ofs) * 4] * c;
 6500-
 6501-#define stbir__2_coeff_remnant(ofs)                                            \
 6502-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6503-	c = hc[0 + (ofs)];                                                         \
 6504-	x0 += decode[0 + (ofs) * 4] * c;                                           \
 6505-	x1 += decode[1 + (ofs) * 4] * c;                                           \
 6506-	x2 += decode[2 + (ofs) * 4] * c;                                           \
 6507-	x3 += decode[3 + (ofs) * 4] * c;                                           \
 6508-	c = hc[1 + (ofs)];                                                         \
 6509-	y0 += decode[4 + (ofs) * 4] * c;                                           \
 6510-	y1 += decode[5 + (ofs) * 4] * c;                                           \
 6511-	y2 += decode[6 + (ofs) * 4] * c;                                           \
 6512-	y3 += decode[7 + (ofs) * 4] * c;
 6513-
 6514-#define stbir__3_coeff_remnant(ofs)                                            \
 6515-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6516-	c = hc[0 + (ofs)];                                                         \
 6517-	x0 += decode[0 + (ofs) * 4] * c;                                           \
 6518-	x1 += decode[1 + (ofs) * 4] * c;                                           \
 6519-	x2 += decode[2 + (ofs) * 4] * c;                                           \
 6520-	x3 += decode[3 + (ofs) * 4] * c;                                           \
 6521-	c = hc[1 + (ofs)];                                                         \
 6522-	y0 += decode[4 + (ofs) * 4] * c;                                           \
 6523-	y1 += decode[5 + (ofs) * 4] * c;                                           \
 6524-	y2 += decode[6 + (ofs) * 4] * c;                                           \
 6525-	y3 += decode[7 + (ofs) * 4] * c;                                           \
 6526-	c = hc[2 + (ofs)];                                                         \
 6527-	x0 += decode[8 + (ofs) * 4] * c;                                           \
 6528-	x1 += decode[9 + (ofs) * 4] * c;                                           \
 6529-	x2 += decode[10 + (ofs) * 4] * c;                                          \
 6530-	x3 += decode[11 + (ofs) * 4] * c;
 6531-
 6532-#define stbir__store_output()                                                  \
 6533-	output[0] = x0 + y0;                                                       \
 6534-	output[1] = x1 + y1;                                                       \
 6535-	output[2] = x2 + y2;                                                       \
 6536-	output[3] = x3 + y3;                                                       \
 6537-	horizontal_coefficients += coefficient_width;                              \
 6538-	++horizontal_contributors;                                                 \
 6539-	output += 4;
 6540-
 6541-#endif
 6542-
 6543-#define STBIR__horizontal_channels 4
 6544-#define STB_IMAGE_RESIZE_DO_HORIZONTALS
 6545-#include STBIR__HEADER_FILENAME
 6546-
 6547-//=================
 6548-// Do 7 channel horizontal routines
 6549-
 6550-#ifdef STBIR_SIMD
 6551-
 6552-#define stbir__1_coeff_only()                                                  \
 6553-	stbir__simdf tot0, tot1, c;                                                \
 6554-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6555-	stbir__simdf_load1(c, hc);                                                 \
 6556-	stbir__simdf_0123to0000(c, c);                                             \
 6557-	stbir__simdf_mult_mem(tot0, c, decode);                                    \
 6558-	stbir__simdf_mult_mem(tot1, c, decode + 3);
 6559-
 6560-#define stbir__2_coeff_only()                                                  \
 6561-	stbir__simdf tot0, tot1, c, cs;                                            \
 6562-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6563-	stbir__simdf_load2(cs, hc);                                                \
 6564-	stbir__simdf_0123to0000(c, cs);                                            \
 6565-	stbir__simdf_mult_mem(tot0, c, decode);                                    \
 6566-	stbir__simdf_mult_mem(tot1, c, decode + 3);                                \
 6567-	stbir__simdf_0123to1111(c, cs);                                            \
 6568-	stbir__simdf_madd_mem(tot0, tot0, c, decode + 7);                          \
 6569-	stbir__simdf_madd_mem(tot1, tot1, c, decode + 10);
 6570-
 6571-#define stbir__3_coeff_only()                                                  \
 6572-	stbir__simdf tot0, tot1, c, cs;                                            \
 6573-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6574-	stbir__simdf_load(cs, hc);                                                 \
 6575-	stbir__simdf_0123to0000(c, cs);                                            \
 6576-	stbir__simdf_mult_mem(tot0, c, decode);                                    \
 6577-	stbir__simdf_mult_mem(tot1, c, decode + 3);                                \
 6578-	stbir__simdf_0123to1111(c, cs);                                            \
 6579-	stbir__simdf_madd_mem(tot0, tot0, c, decode + 7);                          \
 6580-	stbir__simdf_madd_mem(tot1, tot1, c, decode + 10);                         \
 6581-	stbir__simdf_0123to2222(c, cs);                                            \
 6582-	stbir__simdf_madd_mem(tot0, tot0, c, decode + 14);                         \
 6583-	stbir__simdf_madd_mem(tot1, tot1, c, decode + 17);
 6584-
 6585-#define stbir__store_output_tiny()                                             \
 6586-	stbir__simdf_store(output + 3, tot1);                                      \
 6587-	stbir__simdf_store(output, tot0);                                          \
 6588-	horizontal_coefficients += coefficient_width;                              \
 6589-	++horizontal_contributors;                                                 \
 6590-	output += 7;
 6591-
 6592-#ifdef STBIR_SIMD8
 6593-
 6594-#define stbir__4_coeff_start()                                                 \
 6595-	stbir__simdf8 tot0, tot1, c, cs;                                           \
 6596-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6597-	stbir__simdf8_load4b(cs, hc);                                              \
 6598-	stbir__simdf8_0123to00000000(c, cs);                                       \
 6599-	stbir__simdf8_mult_mem(tot0, c, decode);                                   \
 6600-	stbir__simdf8_0123to11111111(c, cs);                                       \
 6601-	stbir__simdf8_mult_mem(tot1, c, decode + 7);                               \
 6602-	stbir__simdf8_0123to22222222(c, cs);                                       \
 6603-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + 14);                        \
 6604-	stbir__simdf8_0123to33333333(c, cs);                                       \
 6605-	stbir__simdf8_madd_mem(tot1, tot1, c, decode + 21);
 6606-
 6607-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 6608-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6609-	stbir__simdf8_load4b(cs, hc + (ofs));                                      \
 6610-	stbir__simdf8_0123to00000000(c, cs);                                       \
 6611-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 7);                 \
 6612-	stbir__simdf8_0123to11111111(c, cs);                                       \
 6613-	stbir__simdf8_madd_mem(tot1, tot1, c, decode + (ofs) * 7 + 7);             \
 6614-	stbir__simdf8_0123to22222222(c, cs);                                       \
 6615-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 7 + 14);            \
 6616-	stbir__simdf8_0123to33333333(c, cs);                                       \
 6617-	stbir__simdf8_madd_mem(tot1, tot1, c, decode + (ofs) * 7 + 21);
 6618-
 6619-#define stbir__1_coeff_remnant(ofs)                                            \
 6620-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6621-	stbir__simdf8_load1b(c, hc + (ofs));                                       \
 6622-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 7);
 6623-
 6624-#define stbir__2_coeff_remnant(ofs)                                            \
 6625-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6626-	stbir__simdf8_load1b(c, hc + (ofs));                                       \
 6627-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 7);                 \
 6628-	stbir__simdf8_load1b(c, hc + (ofs) + 1);                                   \
 6629-	stbir__simdf8_madd_mem(tot1, tot1, c, decode + (ofs) * 7 + 7);
 6630-
 6631-#define stbir__3_coeff_remnant(ofs)                                            \
 6632-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6633-	stbir__simdf8_load4b(cs, hc + (ofs));                                      \
 6634-	stbir__simdf8_0123to00000000(c, cs);                                       \
 6635-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 7);                 \
 6636-	stbir__simdf8_0123to11111111(c, cs);                                       \
 6637-	stbir__simdf8_madd_mem(tot1, tot1, c, decode + (ofs) * 7 + 7);             \
 6638-	stbir__simdf8_0123to22222222(c, cs);                                       \
 6639-	stbir__simdf8_madd_mem(tot0, tot0, c, decode + (ofs) * 7 + 14);
 6640-
 6641-#define stbir__store_output()                                                  \
 6642-	stbir__simdf8_add(tot0, tot0, tot1);                                       \
 6643-	horizontal_coefficients += coefficient_width;                              \
 6644-	++horizontal_contributors;                                                 \
 6645-	output += 7;                                                               \
 6646-	if (output < output_end) {                                                 \
 6647-		stbir__simdf8_store(output - 7, tot0);                                 \
 6648-		continue;                                                              \
 6649-	}                                                                          \
 6650-	stbir__simdf_store(                                                        \
 6651-	    output - 7 + 3,                                                        \
 6652-	    stbir__simdf_swiz(stbir__simdf8_gettop4(tot0), 0, 0, 1, 2));           \
 6653-	stbir__simdf_store(output - 7, stbir__if_simdf8_cast_to_simdf4(tot0));     \
 6654-	break;
 6655-
 6656-#else
 6657-
 6658-#define stbir__4_coeff_start()                                                 \
 6659-	stbir__simdf tot0, tot1, tot2, tot3, c, cs;                                \
 6660-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6661-	stbir__simdf_load(cs, hc);                                                 \
 6662-	stbir__simdf_0123to0000(c, cs);                                            \
 6663-	stbir__simdf_mult_mem(tot0, c, decode);                                    \
 6664-	stbir__simdf_mult_mem(tot1, c, decode + 3);                                \
 6665-	stbir__simdf_0123to1111(c, cs);                                            \
 6666-	stbir__simdf_mult_mem(tot2, c, decode + 7);                                \
 6667-	stbir__simdf_mult_mem(tot3, c, decode + 10);                               \
 6668-	stbir__simdf_0123to2222(c, cs);                                            \
 6669-	stbir__simdf_madd_mem(tot0, tot0, c, decode + 14);                         \
 6670-	stbir__simdf_madd_mem(tot1, tot1, c, decode + 17);                         \
 6671-	stbir__simdf_0123to3333(c, cs);                                            \
 6672-	stbir__simdf_madd_mem(tot2, tot2, c, decode + 21);                         \
 6673-	stbir__simdf_madd_mem(tot3, tot3, c, decode + 24);
 6674-
 6675-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 6676-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6677-	stbir__simdf_load(cs, hc + (ofs));                                         \
 6678-	stbir__simdf_0123to0000(c, cs);                                            \
 6679-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 7);                  \
 6680-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 7 + 3);              \
 6681-	stbir__simdf_0123to1111(c, cs);                                            \
 6682-	stbir__simdf_madd_mem(tot2, tot2, c, decode + (ofs) * 7 + 7);              \
 6683-	stbir__simdf_madd_mem(tot3, tot3, c, decode + (ofs) * 7 + 10);             \
 6684-	stbir__simdf_0123to2222(c, cs);                                            \
 6685-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 7 + 14);             \
 6686-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 7 + 17);             \
 6687-	stbir__simdf_0123to3333(c, cs);                                            \
 6688-	stbir__simdf_madd_mem(tot2, tot2, c, decode + (ofs) * 7 + 21);             \
 6689-	stbir__simdf_madd_mem(tot3, tot3, c, decode + (ofs) * 7 + 24);
 6690-
 6691-#define stbir__1_coeff_remnant(ofs)                                            \
 6692-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6693-	stbir__simdf_load1(c, hc + (ofs));                                         \
 6694-	stbir__simdf_0123to0000(c, c);                                             \
 6695-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 7);                  \
 6696-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 7 + 3);
 6697-
 6698-#define stbir__2_coeff_remnant(ofs)                                            \
 6699-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6700-	stbir__simdf_load2(cs, hc + (ofs));                                        \
 6701-	stbir__simdf_0123to0000(c, cs);                                            \
 6702-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 7);                  \
 6703-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 7 + 3);              \
 6704-	stbir__simdf_0123to1111(c, cs);                                            \
 6705-	stbir__simdf_madd_mem(tot2, tot2, c, decode + (ofs) * 7 + 7);              \
 6706-	stbir__simdf_madd_mem(tot3, tot3, c, decode + (ofs) * 7 + 10);
 6707-
 6708-#define stbir__3_coeff_remnant(ofs)                                            \
 6709-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6710-	stbir__simdf_load(cs, hc + (ofs));                                         \
 6711-	stbir__simdf_0123to0000(c, cs);                                            \
 6712-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 7);                  \
 6713-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 7 + 3);              \
 6714-	stbir__simdf_0123to1111(c, cs);                                            \
 6715-	stbir__simdf_madd_mem(tot2, tot2, c, decode + (ofs) * 7 + 7);              \
 6716-	stbir__simdf_madd_mem(tot3, tot3, c, decode + (ofs) * 7 + 10);             \
 6717-	stbir__simdf_0123to2222(c, cs);                                            \
 6718-	stbir__simdf_madd_mem(tot0, tot0, c, decode + (ofs) * 7 + 14);             \
 6719-	stbir__simdf_madd_mem(tot1, tot1, c, decode + (ofs) * 7 + 17);
 6720-
 6721-#define stbir__store_output()                                                  \
 6722-	stbir__simdf_add(tot0, tot0, tot2);                                        \
 6723-	stbir__simdf_add(tot1, tot1, tot3);                                        \
 6724-	stbir__simdf_store(output + 3, tot1);                                      \
 6725-	stbir__simdf_store(output, tot0);                                          \
 6726-	horizontal_coefficients += coefficient_width;                              \
 6727-	++horizontal_contributors;                                                 \
 6728-	output += 7;
 6729-
 6730-#endif
 6731-
 6732-#else
 6733-
 6734-#define stbir__1_coeff_only()                                                  \
 6735-	float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c;                         \
 6736-	c = hc[0];                                                                 \
 6737-	tot0 = decode[0] * c;                                                      \
 6738-	tot1 = decode[1] * c;                                                      \
 6739-	tot2 = decode[2] * c;                                                      \
 6740-	tot3 = decode[3] * c;                                                      \
 6741-	tot4 = decode[4] * c;                                                      \
 6742-	tot5 = decode[5] * c;                                                      \
 6743-	tot6 = decode[6] * c;
 6744-
 6745-#define stbir__2_coeff_only()                                                  \
 6746-	float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c;                         \
 6747-	c = hc[0];                                                                 \
 6748-	tot0 = decode[0] * c;                                                      \
 6749-	tot1 = decode[1] * c;                                                      \
 6750-	tot2 = decode[2] * c;                                                      \
 6751-	tot3 = decode[3] * c;                                                      \
 6752-	tot4 = decode[4] * c;                                                      \
 6753-	tot5 = decode[5] * c;                                                      \
 6754-	tot6 = decode[6] * c;                                                      \
 6755-	c = hc[1];                                                                 \
 6756-	tot0 += decode[7] * c;                                                     \
 6757-	tot1 += decode[8] * c;                                                     \
 6758-	tot2 += decode[9] * c;                                                     \
 6759-	tot3 += decode[10] * c;                                                    \
 6760-	tot4 += decode[11] * c;                                                    \
 6761-	tot5 += decode[12] * c;                                                    \
 6762-	tot6 += decode[13] * c;
 6763-
 6764-#define stbir__3_coeff_only()                                                  \
 6765-	float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c;                         \
 6766-	c = hc[0];                                                                 \
 6767-	tot0 = decode[0] * c;                                                      \
 6768-	tot1 = decode[1] * c;                                                      \
 6769-	tot2 = decode[2] * c;                                                      \
 6770-	tot3 = decode[3] * c;                                                      \
 6771-	tot4 = decode[4] * c;                                                      \
 6772-	tot5 = decode[5] * c;                                                      \
 6773-	tot6 = decode[6] * c;                                                      \
 6774-	c = hc[1];                                                                 \
 6775-	tot0 += decode[7] * c;                                                     \
 6776-	tot1 += decode[8] * c;                                                     \
 6777-	tot2 += decode[9] * c;                                                     \
 6778-	tot3 += decode[10] * c;                                                    \
 6779-	tot4 += decode[11] * c;                                                    \
 6780-	tot5 += decode[12] * c;                                                    \
 6781-	tot6 += decode[13] * c;                                                    \
 6782-	c = hc[2];                                                                 \
 6783-	tot0 += decode[14] * c;                                                    \
 6784-	tot1 += decode[15] * c;                                                    \
 6785-	tot2 += decode[16] * c;                                                    \
 6786-	tot3 += decode[17] * c;                                                    \
 6787-	tot4 += decode[18] * c;                                                    \
 6788-	tot5 += decode[19] * c;                                                    \
 6789-	tot6 += decode[20] * c;
 6790-
 6791-#define stbir__store_output_tiny()                                             \
 6792-	output[0] = tot0;                                                          \
 6793-	output[1] = tot1;                                                          \
 6794-	output[2] = tot2;                                                          \
 6795-	output[3] = tot3;                                                          \
 6796-	output[4] = tot4;                                                          \
 6797-	output[5] = tot5;                                                          \
 6798-	output[6] = tot6;                                                          \
 6799-	horizontal_coefficients += coefficient_width;                              \
 6800-	++horizontal_contributors;                                                 \
 6801-	output += 7;
 6802-
 6803-#define stbir__4_coeff_start()                                                 \
 6804-	float x0, x1, x2, x3, x4, x5, x6, y0, y1, y2, y3, y4, y5, y6, c;           \
 6805-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6806-	c = hc[0];                                                                 \
 6807-	x0 = decode[0] * c;                                                        \
 6808-	x1 = decode[1] * c;                                                        \
 6809-	x2 = decode[2] * c;                                                        \
 6810-	x3 = decode[3] * c;                                                        \
 6811-	x4 = decode[4] * c;                                                        \
 6812-	x5 = decode[5] * c;                                                        \
 6813-	x6 = decode[6] * c;                                                        \
 6814-	c = hc[1];                                                                 \
 6815-	y0 = decode[7] * c;                                                        \
 6816-	y1 = decode[8] * c;                                                        \
 6817-	y2 = decode[9] * c;                                                        \
 6818-	y3 = decode[10] * c;                                                       \
 6819-	y4 = decode[11] * c;                                                       \
 6820-	y5 = decode[12] * c;                                                       \
 6821-	y6 = decode[13] * c;                                                       \
 6822-	c = hc[2];                                                                 \
 6823-	x0 += decode[14] * c;                                                      \
 6824-	x1 += decode[15] * c;                                                      \
 6825-	x2 += decode[16] * c;                                                      \
 6826-	x3 += decode[17] * c;                                                      \
 6827-	x4 += decode[18] * c;                                                      \
 6828-	x5 += decode[19] * c;                                                      \
 6829-	x6 += decode[20] * c;                                                      \
 6830-	c = hc[3];                                                                 \
 6831-	y0 += decode[21] * c;                                                      \
 6832-	y1 += decode[22] * c;                                                      \
 6833-	y2 += decode[23] * c;                                                      \
 6834-	y3 += decode[24] * c;                                                      \
 6835-	y4 += decode[25] * c;                                                      \
 6836-	y5 += decode[26] * c;                                                      \
 6837-	y6 += decode[27] * c;
 6838-
 6839-#define stbir__4_coeff_continue_from_4(ofs)                                    \
 6840-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6841-	c = hc[0 + (ofs)];                                                         \
 6842-	x0 += decode[0 + (ofs) * 7] * c;                                           \
 6843-	x1 += decode[1 + (ofs) * 7] * c;                                           \
 6844-	x2 += decode[2 + (ofs) * 7] * c;                                           \
 6845-	x3 += decode[3 + (ofs) * 7] * c;                                           \
 6846-	x4 += decode[4 + (ofs) * 7] * c;                                           \
 6847-	x5 += decode[5 + (ofs) * 7] * c;                                           \
 6848-	x6 += decode[6 + (ofs) * 7] * c;                                           \
 6849-	c = hc[1 + (ofs)];                                                         \
 6850-	y0 += decode[7 + (ofs) * 7] * c;                                           \
 6851-	y1 += decode[8 + (ofs) * 7] * c;                                           \
 6852-	y2 += decode[9 + (ofs) * 7] * c;                                           \
 6853-	y3 += decode[10 + (ofs) * 7] * c;                                          \
 6854-	y4 += decode[11 + (ofs) * 7] * c;                                          \
 6855-	y5 += decode[12 + (ofs) * 7] * c;                                          \
 6856-	y6 += decode[13 + (ofs) * 7] * c;                                          \
 6857-	c = hc[2 + (ofs)];                                                         \
 6858-	x0 += decode[14 + (ofs) * 7] * c;                                          \
 6859-	x1 += decode[15 + (ofs) * 7] * c;                                          \
 6860-	x2 += decode[16 + (ofs) * 7] * c;                                          \
 6861-	x3 += decode[17 + (ofs) * 7] * c;                                          \
 6862-	x4 += decode[18 + (ofs) * 7] * c;                                          \
 6863-	x5 += decode[19 + (ofs) * 7] * c;                                          \
 6864-	x6 += decode[20 + (ofs) * 7] * c;                                          \
 6865-	c = hc[3 + (ofs)];                                                         \
 6866-	y0 += decode[21 + (ofs) * 7] * c;                                          \
 6867-	y1 += decode[22 + (ofs) * 7] * c;                                          \
 6868-	y2 += decode[23 + (ofs) * 7] * c;                                          \
 6869-	y3 += decode[24 + (ofs) * 7] * c;                                          \
 6870-	y4 += decode[25 + (ofs) * 7] * c;                                          \
 6871-	y5 += decode[26 + (ofs) * 7] * c;                                          \
 6872-	y6 += decode[27 + (ofs) * 7] * c;
 6873-
 6874-#define stbir__1_coeff_remnant(ofs)                                            \
 6875-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6876-	c = hc[0 + (ofs)];                                                         \
 6877-	x0 += decode[0 + (ofs) * 7] * c;                                           \
 6878-	x1 += decode[1 + (ofs) * 7] * c;                                           \
 6879-	x2 += decode[2 + (ofs) * 7] * c;                                           \
 6880-	x3 += decode[3 + (ofs) * 7] * c;                                           \
 6881-	x4 += decode[4 + (ofs) * 7] * c;                                           \
 6882-	x5 += decode[5 + (ofs) * 7] * c;                                           \
 6883-	x6 += decode[6 + (ofs) * 7] * c;
 6884-
 6885-#define stbir__2_coeff_remnant(ofs)                                            \
 6886-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6887-	c = hc[0 + (ofs)];                                                         \
 6888-	x0 += decode[0 + (ofs) * 7] * c;                                           \
 6889-	x1 += decode[1 + (ofs) * 7] * c;                                           \
 6890-	x2 += decode[2 + (ofs) * 7] * c;                                           \
 6891-	x3 += decode[3 + (ofs) * 7] * c;                                           \
 6892-	x4 += decode[4 + (ofs) * 7] * c;                                           \
 6893-	x5 += decode[5 + (ofs) * 7] * c;                                           \
 6894-	x6 += decode[6 + (ofs) * 7] * c;                                           \
 6895-	c = hc[1 + (ofs)];                                                         \
 6896-	y0 += decode[7 + (ofs) * 7] * c;                                           \
 6897-	y1 += decode[8 + (ofs) * 7] * c;                                           \
 6898-	y2 += decode[9 + (ofs) * 7] * c;                                           \
 6899-	y3 += decode[10 + (ofs) * 7] * c;                                          \
 6900-	y4 += decode[11 + (ofs) * 7] * c;                                          \
 6901-	y5 += decode[12 + (ofs) * 7] * c;                                          \
 6902-	y6 += decode[13 + (ofs) * 7] * c;
 6903-
 6904-#define stbir__3_coeff_remnant(ofs)                                            \
 6905-	STBIR_SIMD_NO_UNROLL(decode);                                              \
 6906-	c = hc[0 + (ofs)];                                                         \
 6907-	x0 += decode[0 + (ofs) * 7] * c;                                           \
 6908-	x1 += decode[1 + (ofs) * 7] * c;                                           \
 6909-	x2 += decode[2 + (ofs) * 7] * c;                                           \
 6910-	x3 += decode[3 + (ofs) * 7] * c;                                           \
 6911-	x4 += decode[4 + (ofs) * 7] * c;                                           \
 6912-	x5 += decode[5 + (ofs) * 7] * c;                                           \
 6913-	x6 += decode[6 + (ofs) * 7] * c;                                           \
 6914-	c = hc[1 + (ofs)];                                                         \
 6915-	y0 += decode[7 + (ofs) * 7] * c;                                           \
 6916-	y1 += decode[8 + (ofs) * 7] * c;                                           \
 6917-	y2 += decode[9 + (ofs) * 7] * c;                                           \
 6918-	y3 += decode[10 + (ofs) * 7] * c;                                          \
 6919-	y4 += decode[11 + (ofs) * 7] * c;                                          \
 6920-	y5 += decode[12 + (ofs) * 7] * c;                                          \
 6921-	y6 += decode[13 + (ofs) * 7] * c;                                          \
 6922-	c = hc[2 + (ofs)];                                                         \
 6923-	x0 += decode[14 + (ofs) * 7] * c;                                          \
 6924-	x1 += decode[15 + (ofs) * 7] * c;                                          \
 6925-	x2 += decode[16 + (ofs) * 7] * c;                                          \
 6926-	x3 += decode[17 + (ofs) * 7] * c;                                          \
 6927-	x4 += decode[18 + (ofs) * 7] * c;                                          \
 6928-	x5 += decode[19 + (ofs) * 7] * c;                                          \
 6929-	x6 += decode[20 + (ofs) * 7] * c;
 6930-
 6931-#define stbir__store_output()                                                  \
 6932-	output[0] = x0 + y0;                                                       \
 6933-	output[1] = x1 + y1;                                                       \
 6934-	output[2] = x2 + y2;                                                       \
 6935-	output[3] = x3 + y3;                                                       \
 6936-	output[4] = x4 + y4;                                                       \
 6937-	output[5] = x5 + y5;                                                       \
 6938-	output[6] = x6 + y6;                                                       \
 6939-	horizontal_coefficients += coefficient_width;                              \
 6940-	++horizontal_contributors;                                                 \
 6941-	output += 7;
 6942-
 6943-#endif
 6944-
 6945-#define STBIR__horizontal_channels 7
 6946-#define STB_IMAGE_RESIZE_DO_HORIZONTALS
 6947-#include STBIR__HEADER_FILENAME
 6948-
 6949-// include all of the vertical resamplers (both scatter and gather versions)
 6950-
 6951-#define STBIR__vertical_channels 1
 6952-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6953-#include STBIR__HEADER_FILENAME
 6954-
 6955-#define STBIR__vertical_channels 1
 6956-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6957-#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
 6958-#include STBIR__HEADER_FILENAME
 6959-
 6960-#define STBIR__vertical_channels 2
 6961-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6962-#include STBIR__HEADER_FILENAME
 6963-
 6964-#define STBIR__vertical_channels 2
 6965-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6966-#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
 6967-#include STBIR__HEADER_FILENAME
 6968-
 6969-#define STBIR__vertical_channels 3
 6970-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6971-#include STBIR__HEADER_FILENAME
 6972-
 6973-#define STBIR__vertical_channels 3
 6974-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6975-#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
 6976-#include STBIR__HEADER_FILENAME
 6977-
 6978-#define STBIR__vertical_channels 4
 6979-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6980-#include STBIR__HEADER_FILENAME
 6981-
 6982-#define STBIR__vertical_channels 4
 6983-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6984-#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
 6985-#include STBIR__HEADER_FILENAME
 6986-
 6987-#define STBIR__vertical_channels 5
 6988-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6989-#include STBIR__HEADER_FILENAME
 6990-
 6991-#define STBIR__vertical_channels 5
 6992-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6993-#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
 6994-#include STBIR__HEADER_FILENAME
 6995-
 6996-#define STBIR__vertical_channels 6
 6997-#define STB_IMAGE_RESIZE_DO_VERTICALS
 6998-#include STBIR__HEADER_FILENAME
 6999-
 7000-#define STBIR__vertical_channels 6
 7001-#define STB_IMAGE_RESIZE_DO_VERTICALS
 7002-#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
 7003-#include STBIR__HEADER_FILENAME
 7004-
 7005-#define STBIR__vertical_channels 7
 7006-#define STB_IMAGE_RESIZE_DO_VERTICALS
 7007-#include STBIR__HEADER_FILENAME
 7008-
 7009-#define STBIR__vertical_channels 7
 7010-#define STB_IMAGE_RESIZE_DO_VERTICALS
 7011-#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
 7012-#include STBIR__HEADER_FILENAME
 7013-
 7014-#define STBIR__vertical_channels 8
 7015-#define STB_IMAGE_RESIZE_DO_VERTICALS
 7016-#include STBIR__HEADER_FILENAME
 7017-
 7018-#define STBIR__vertical_channels 8
 7019-#define STB_IMAGE_RESIZE_DO_VERTICALS
 7020-#define STB_IMAGE_RESIZE_VERTICAL_CONTINUE
 7021-#include STBIR__HEADER_FILENAME
 7022-
 7023-typedef void
 7024-STBIR_VERTICAL_GATHERFUNC(float *output, float const *coeffs,
 7025-                          float const **inputs, float const *input0_end);
 7026-
 7027-static STBIR_VERTICAL_GATHERFUNC *stbir__vertical_gathers[8] = {
 7028-    stbir__vertical_gather_with_1_coeffs, stbir__vertical_gather_with_2_coeffs,
 7029-    stbir__vertical_gather_with_3_coeffs, stbir__vertical_gather_with_4_coeffs,
 7030-    stbir__vertical_gather_with_5_coeffs, stbir__vertical_gather_with_6_coeffs,
 7031-    stbir__vertical_gather_with_7_coeffs, stbir__vertical_gather_with_8_coeffs};
 7032-
 7033-static STBIR_VERTICAL_GATHERFUNC *stbir__vertical_gathers_continues[8] = {
 7034-    stbir__vertical_gather_with_1_coeffs_cont,
 7035-    stbir__vertical_gather_with_2_coeffs_cont,
 7036-    stbir__vertical_gather_with_3_coeffs_cont,
 7037-    stbir__vertical_gather_with_4_coeffs_cont,
 7038-    stbir__vertical_gather_with_5_coeffs_cont,
 7039-    stbir__vertical_gather_with_6_coeffs_cont,
 7040-    stbir__vertical_gather_with_7_coeffs_cont,
 7041-    stbir__vertical_gather_with_8_coeffs_cont};
 7042-
 7043-typedef void
 7044-STBIR_VERTICAL_SCATTERFUNC(float **outputs, float const *coeffs,
 7045-                           float const *input, float const *input_end);
 7046-
 7047-static STBIR_VERTICAL_SCATTERFUNC *stbir__vertical_scatter_sets[8] = {
 7048-    stbir__vertical_scatter_with_1_coeffs,
 7049-    stbir__vertical_scatter_with_2_coeffs,
 7050-    stbir__vertical_scatter_with_3_coeffs,
 7051-    stbir__vertical_scatter_with_4_coeffs,
 7052-    stbir__vertical_scatter_with_5_coeffs,
 7053-    stbir__vertical_scatter_with_6_coeffs,
 7054-    stbir__vertical_scatter_with_7_coeffs,
 7055-    stbir__vertical_scatter_with_8_coeffs};
 7056-
 7057-static STBIR_VERTICAL_SCATTERFUNC *stbir__vertical_scatter_blends[8] = {
 7058-    stbir__vertical_scatter_with_1_coeffs_cont,
 7059-    stbir__vertical_scatter_with_2_coeffs_cont,
 7060-    stbir__vertical_scatter_with_3_coeffs_cont,
 7061-    stbir__vertical_scatter_with_4_coeffs_cont,
 7062-    stbir__vertical_scatter_with_5_coeffs_cont,
 7063-    stbir__vertical_scatter_with_6_coeffs_cont,
 7064-    stbir__vertical_scatter_with_7_coeffs_cont,
 7065-    stbir__vertical_scatter_with_8_coeffs_cont};
 7066-
 7067-static void
 7068-stbir__encode_scanline(stbir__info const *stbir_info, void *output_buffer_data,
 7069-                       float *encode_buffer,
 7070-                       int row STBIR_ONLY_PROFILE_GET_SPLIT_INFO)
 7071-{
 7072-	int num_pixels = stbir_info->horizontal.scale_info.output_sub_size;
 7073-	int channels = stbir_info->channels;
 7074-	int width_times_channels = num_pixels * channels;
 7075-	void *output_buffer;
 7076-
 7077-	// un-alpha weight if we need to
 7078-	if (stbir_info->alpha_unweight) {
 7079-		STBIR_PROFILE_START(unalpha);
 7080-		stbir_info->alpha_unweight(encode_buffer, width_times_channels);
 7081-		STBIR_PROFILE_END(unalpha);
 7082-	}
 7083-
 7084-	// write directly into output by default
 7085-	output_buffer = output_buffer_data;
 7086-
 7087-	// if we have an output callback, we first convert the decode buffer in
 7088-	// place (and then hand that to the callback)
 7089-	if (stbir_info->out_pixels_cb) {
 7090-		output_buffer = encode_buffer;
 7091-	}
 7092-
 7093-	STBIR_PROFILE_START(encode);
 7094-	// convert into the output buffer
 7095-	stbir_info->encode_pixels(output_buffer, width_times_channels,
 7096-	                          encode_buffer);
 7097-	STBIR_PROFILE_END(encode);
 7098-
 7099-	// if we have an output callback, call it to send the data
 7100-	if (stbir_info->out_pixels_cb) {
 7101-		stbir_info->out_pixels_cb(output_buffer, num_pixels, row,
 7102-		                          stbir_info->user_data);
 7103-	}
 7104-}
 7105-
 7106-// Get the ring buffer pointer for an index
 7107-static float *
 7108-stbir__get_ring_buffer_entry(stbir__info const *stbir_info,
 7109-                             stbir__per_split_info const *split_info, int index)
 7110-{
 7111-	STBIR_ASSERT(index < stbir_info->ring_buffer_num_entries);
 7112-
 7113-#ifdef STBIR__SEPARATE_ALLOCATIONS
 7114-	return split_info->ring_buffers[index];
 7115-#else
 7116-	return (float *)(((char *)split_info->ring_buffer) +
 7117-	                 (index * stbir_info->ring_buffer_length_bytes));
 7118-#endif
 7119-}
 7120-
 7121-// Get the specified scan line from the ring buffer
 7122-static float *
 7123-stbir__get_ring_buffer_scanline(stbir__info const *stbir_info,
 7124-                                stbir__per_split_info const *split_info,
 7125-                                int get_scanline)
 7126-{
 7127-	int ring_buffer_index =
 7128-	    (split_info->ring_buffer_begin_index +
 7129-	     (get_scanline - split_info->ring_buffer_first_scanline)) %
 7130-	    stbir_info->ring_buffer_num_entries;
 7131-	return stbir__get_ring_buffer_entry(stbir_info, split_info,
 7132-	                                    ring_buffer_index);
 7133-}
 7134-
 7135-static void
 7136-stbir__resample_horizontal_gather(
 7137-    stbir__info const *stbir_info, float *output_buffer,
 7138-    float const *input_buffer STBIR_ONLY_PROFILE_GET_SPLIT_INFO)
 7139-{
 7140-	float const *decode_buffer =
 7141-	    input_buffer - (stbir_info->scanline_extents.conservative.n0 *
 7142-	                    stbir_info->effective_channels);
 7143-
 7144-	STBIR_PROFILE_START(horizontal);
 7145-	if ((stbir_info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE) &&
 7146-	    (stbir_info->horizontal.scale_info.scale == 1.0f)) {
 7147-		STBIR_MEMCPY(output_buffer, input_buffer,
 7148-		             stbir_info->horizontal.scale_info.output_sub_size *
 7149-		                 sizeof(float) * stbir_info->effective_channels);
 7150-	} else {
 7151-		stbir_info->horizontal_gather_channels(
 7152-		    output_buffer, stbir_info->horizontal.scale_info.output_sub_size,
 7153-		    decode_buffer, stbir_info->horizontal.contributors,
 7154-		    stbir_info->horizontal.coefficients,
 7155-		    stbir_info->horizontal.coefficient_width);
 7156-	}
 7157-	STBIR_PROFILE_END(horizontal);
 7158-}
 7159-
 7160-static void
 7161-stbir__resample_vertical_gather(stbir__info const *stbir_info,
 7162-                                stbir__per_split_info *split_info, int n,
 7163-                                int contrib_n0, int contrib_n1,
 7164-                                float const *vertical_coefficients)
 7165-{
 7166-	float *encode_buffer = split_info->vertical_buffer;
 7167-	float *decode_buffer = split_info->decode_buffer;
 7168-	int vertical_first = stbir_info->vertical_first;
 7169-	int width = (vertical_first)
 7170-	                ? (stbir_info->scanline_extents.conservative.n1 -
 7171-	                   stbir_info->scanline_extents.conservative.n0 + 1)
 7172-	                : stbir_info->horizontal.scale_info.output_sub_size;
 7173-	int width_times_channels = stbir_info->effective_channels * width;
 7174-
 7175-	STBIR_ASSERT(stbir_info->vertical.is_gather);
 7176-
 7177-	// loop over the contributing scanlines and scale into the buffer
 7178-	STBIR_PROFILE_START(vertical);
 7179-	{
 7180-		int k = 0, total = contrib_n1 - contrib_n0 + 1;
 7181-		STBIR_ASSERT(total > 0);
 7182-		do {
 7183-			float const *inputs[8];
 7184-			int i, cnt = total;
 7185-			if (cnt > 8) {
 7186-				cnt = 8;
 7187-			}
 7188-			for (i = 0; i < cnt; i++) {
 7189-				inputs[i] = stbir__get_ring_buffer_scanline(
 7190-				    stbir_info, split_info, k + i + contrib_n0);
 7191-			}
 7192-
 7193-			// call the N scanlines at a time function (up to 8 scanlines of
 7194-			// blending at once)
 7195-			((k == 0) ? stbir__vertical_gathers
 7196-			          : stbir__vertical_gathers_continues)[cnt - 1](
 7197-			    (vertical_first) ? decode_buffer : encode_buffer,
 7198-			    vertical_coefficients + k, inputs,
 7199-			    inputs[0] + width_times_channels);
 7200-			k += cnt;
 7201-			total -= cnt;
 7202-		} while (total);
 7203-	}
 7204-	STBIR_PROFILE_END(vertical);
 7205-
 7206-	if (vertical_first) {
 7207-		// Now resample the gathered vertical data in the horizontal axis into
 7208-		// the encode buffer
 7209-		decode_buffer[width_times_channels] =
 7210-		    0.0f; // clear two over for horizontals with a remnant of 3
 7211-		decode_buffer[width_times_channels + 1] = 0.0f;
 7212-		stbir__resample_horizontal_gather(
 7213-		    stbir_info, encode_buffer,
 7214-		    decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO);
 7215-	}
 7216-
 7217-	stbir__encode_scanline(
 7218-	    stbir_info,
 7219-	    ((char *)stbir_info->output_data) +
 7220-	        ((size_t)n * (size_t)stbir_info->output_stride_bytes),
 7221-	    encode_buffer, n STBIR_ONLY_PROFILE_SET_SPLIT_INFO);
 7222-}
 7223-
 7224-static void
 7225-stbir__decode_and_resample_for_vertical_gather_loop(
 7226-    stbir__info const *stbir_info, stbir__per_split_info *split_info, int n)
 7227-{
 7228-	int ring_buffer_index;
 7229-	float *ring_buffer;
 7230-
 7231-	// Decode the nth scanline from the source image into the decode buffer.
 7232-	stbir__decode_scanline(
 7233-	    stbir_info, n,
 7234-	    split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO);
 7235-
 7236-	// update new end scanline
 7237-	split_info->ring_buffer_last_scanline = n;
 7238-
 7239-	// get ring buffer
 7240-	ring_buffer_index = (split_info->ring_buffer_begin_index +
 7241-	                     (split_info->ring_buffer_last_scanline -
 7242-	                      split_info->ring_buffer_first_scanline)) %
 7243-	                    stbir_info->ring_buffer_num_entries;
 7244-	ring_buffer =
 7245-	    stbir__get_ring_buffer_entry(stbir_info, split_info, ring_buffer_index);
 7246-
 7247-	// Now resample it into the ring buffer.
 7248-	stbir__resample_horizontal_gather(
 7249-	    stbir_info, ring_buffer,
 7250-	    split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO);
 7251-
 7252-	// Now it's sitting in the ring buffer ready to be used as source for the
 7253-	// vertical sampling.
 7254-}
 7255-
 7256-static void
 7257-stbir__vertical_gather_loop(stbir__info const *stbir_info,
 7258-                            stbir__per_split_info *split_info, int split_count)
 7259-{
 7260-	int y, start_output_y, end_output_y;
 7261-	stbir__contributors *vertical_contributors =
 7262-	    stbir_info->vertical.contributors;
 7263-	float const *vertical_coefficients = stbir_info->vertical.coefficients;
 7264-
 7265-	STBIR_ASSERT(stbir_info->vertical.is_gather);
 7266-
 7267-	start_output_y = split_info->start_output_y;
 7268-	end_output_y = split_info[split_count - 1].end_output_y;
 7269-
 7270-	vertical_contributors += start_output_y;
 7271-	vertical_coefficients +=
 7272-	    start_output_y * stbir_info->vertical.coefficient_width;
 7273-
 7274-	// initialize the ring buffer for gathering
 7275-	split_info->ring_buffer_begin_index = 0;
 7276-	split_info->ring_buffer_first_scanline = vertical_contributors->n0;
 7277-	split_info->ring_buffer_last_scanline =
 7278-	    split_info->ring_buffer_first_scanline - 1; // means "empty"
 7279-
 7280-	for (y = start_output_y; y < end_output_y; y++) {
 7281-		int in_first_scanline, in_last_scanline;
 7282-
 7283-		in_first_scanline = vertical_contributors->n0;
 7284-		in_last_scanline = vertical_contributors->n1;
 7285-
 7286-		// make sure the indexing hasn't broken
 7287-		STBIR_ASSERT(in_first_scanline >=
 7288-		             split_info->ring_buffer_first_scanline);
 7289-
 7290-		// Load in new scanlines
 7291-		while (in_last_scanline > split_info->ring_buffer_last_scanline) {
 7292-			STBIR_ASSERT((split_info->ring_buffer_last_scanline -
 7293-			              split_info->ring_buffer_first_scanline + 1) <=
 7294-			             stbir_info->ring_buffer_num_entries);
 7295-
 7296-			// make sure there was room in the ring buffer when we add new
 7297-			// scanlines
 7298-			if ((split_info->ring_buffer_last_scanline -
 7299-			     split_info->ring_buffer_first_scanline + 1) ==
 7300-			    stbir_info->ring_buffer_num_entries) {
 7301-				split_info->ring_buffer_first_scanline++;
 7302-				split_info->ring_buffer_begin_index++;
 7303-			}
 7304-
 7305-			if (stbir_info->vertical_first) {
 7306-				float *ring_buffer = stbir__get_ring_buffer_scanline(
 7307-				    stbir_info, split_info,
 7308-				    ++split_info->ring_buffer_last_scanline);
 7309-				// Decode the nth scanline from the source image into the decode
 7310-				// buffer.
 7311-				stbir__decode_scanline(
 7312-				    stbir_info, split_info->ring_buffer_last_scanline,
 7313-				    ring_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO);
 7314-			} else {
 7315-				stbir__decode_and_resample_for_vertical_gather_loop(
 7316-				    stbir_info, split_info,
 7317-				    split_info->ring_buffer_last_scanline + 1);
 7318-			}
 7319-		}
 7320-
 7321-		// Now all buffers should be ready to write a row of vertical sampling,
 7322-		// so do it.
 7323-		stbir__resample_vertical_gather(stbir_info, split_info, y,
 7324-		                                in_first_scanline, in_last_scanline,
 7325-		                                vertical_coefficients);
 7326-
 7327-		++vertical_contributors;
 7328-		vertical_coefficients += stbir_info->vertical.coefficient_width;
 7329-	}
 7330-}
 7331-
 7332-#define STBIR__FLOAT_EMPTY_MARKER 3.0e+38F
 7333-#define STBIR__FLOAT_BUFFER_IS_EMPTY(ptr)                                      \
 7334-	((ptr)[0] == STBIR__FLOAT_EMPTY_MARKER)
 7335-
 7336-static void
 7337-stbir__encode_first_scanline_from_scatter(stbir__info const *stbir_info,
 7338-                                          stbir__per_split_info *split_info)
 7339-{
 7340-	// evict a scanline out into the output buffer
 7341-	float *ring_buffer_entry = stbir__get_ring_buffer_entry(
 7342-	    stbir_info, split_info, split_info->ring_buffer_begin_index);
 7343-
 7344-	// dump the scanline out
 7345-	stbir__encode_scanline(stbir_info,
 7346-	                       ((char *)stbir_info->output_data) +
 7347-	                           ((size_t)split_info->ring_buffer_first_scanline *
 7348-	                            (size_t)stbir_info->output_stride_bytes),
 7349-	                       ring_buffer_entry,
 7350-	                       split_info->ring_buffer_first_scanline
 7351-	                           STBIR_ONLY_PROFILE_SET_SPLIT_INFO);
 7352-
 7353-	// mark it as empty
 7354-	ring_buffer_entry[0] = STBIR__FLOAT_EMPTY_MARKER;
 7355-
 7356-	// advance the first scanline
 7357-	split_info->ring_buffer_first_scanline++;
 7358-	if (++split_info->ring_buffer_begin_index ==
 7359-	    stbir_info->ring_buffer_num_entries) {
 7360-		split_info->ring_buffer_begin_index = 0;
 7361-	}
 7362-}
 7363-
 7364-static void
 7365-stbir__horizontal_resample_and_encode_first_scanline_from_scatter(
 7366-    stbir__info const *stbir_info, stbir__per_split_info *split_info)
 7367-{
 7368-	// evict a scanline out into the output buffer
 7369-
 7370-	float *ring_buffer_entry = stbir__get_ring_buffer_entry(
 7371-	    stbir_info, split_info, split_info->ring_buffer_begin_index);
 7372-
 7373-	// Now resample it into the buffer.
 7374-	stbir__resample_horizontal_gather(
 7375-	    stbir_info, split_info->vertical_buffer,
 7376-	    ring_buffer_entry STBIR_ONLY_PROFILE_SET_SPLIT_INFO);
 7377-
 7378-	// dump the scanline out
 7379-	stbir__encode_scanline(stbir_info,
 7380-	                       ((char *)stbir_info->output_data) +
 7381-	                           ((size_t)split_info->ring_buffer_first_scanline *
 7382-	                            (size_t)stbir_info->output_stride_bytes),
 7383-	                       split_info->vertical_buffer,
 7384-	                       split_info->ring_buffer_first_scanline
 7385-	                           STBIR_ONLY_PROFILE_SET_SPLIT_INFO);
 7386-
 7387-	// mark it as empty
 7388-	ring_buffer_entry[0] = STBIR__FLOAT_EMPTY_MARKER;
 7389-
 7390-	// advance the first scanline
 7391-	split_info->ring_buffer_first_scanline++;
 7392-	if (++split_info->ring_buffer_begin_index ==
 7393-	    stbir_info->ring_buffer_num_entries) {
 7394-		split_info->ring_buffer_begin_index = 0;
 7395-	}
 7396-}
 7397-
 7398-static void
 7399-stbir__resample_vertical_scatter(stbir__info const *stbir_info,
 7400-                                 stbir__per_split_info *split_info, int n0,
 7401-                                 int n1, float const *vertical_coefficients,
 7402-                                 float const *vertical_buffer,
 7403-                                 float const *vertical_buffer_end)
 7404-{
 7405-	STBIR_ASSERT(!stbir_info->vertical.is_gather);
 7406-
 7407-	STBIR_PROFILE_START(vertical);
 7408-	{
 7409-		int k = 0, total = n1 - n0 + 1;
 7410-		STBIR_ASSERT(total > 0);
 7411-		do {
 7412-			float *outputs[8];
 7413-			int i, n = total;
 7414-			if (n > 8) {
 7415-				n = 8;
 7416-			}
 7417-			for (i = 0; i < n; i++) {
 7418-				outputs[i] = stbir__get_ring_buffer_scanline(
 7419-				    stbir_info, split_info, k + i + n0);
 7420-				if ((i) &&
 7421-				    (STBIR__FLOAT_BUFFER_IS_EMPTY(outputs[i]) !=
 7422-				     STBIR__FLOAT_BUFFER_IS_EMPTY(
 7423-				         outputs[0]))) // make sure runs are of the same type
 7424-				{
 7425-					n = i;
 7426-					break;
 7427-				}
 7428-			}
 7429-			// call the scatter to N scanlines at a time function (up to 8
 7430-			// scanlines of scattering at once)
 7431-			((STBIR__FLOAT_BUFFER_IS_EMPTY(outputs[0]))
 7432-			     ? stbir__vertical_scatter_sets
 7433-			     : stbir__vertical_scatter_blends)[n - 1](
 7434-			    outputs, vertical_coefficients + k, vertical_buffer,
 7435-			    vertical_buffer_end);
 7436-			k += n;
 7437-			total -= n;
 7438-		} while (total);
 7439-	}
 7440-
 7441-	STBIR_PROFILE_END(vertical);
 7442-}
 7443-
 7444-typedef void
 7445-stbir__handle_scanline_for_scatter_func(stbir__info const *stbir_info,
 7446-                                        stbir__per_split_info *split_info);
 7447-
 7448-static void
 7449-stbir__vertical_scatter_loop(stbir__info const *stbir_info,
 7450-                             stbir__per_split_info *split_info, int split_count)
 7451-{
 7452-	int y, start_output_y, end_output_y, start_input_y, end_input_y;
 7453-	stbir__contributors *vertical_contributors =
 7454-	    stbir_info->vertical.contributors;
 7455-	float const *vertical_coefficients = stbir_info->vertical.coefficients;
 7456-	stbir__handle_scanline_for_scatter_func *handle_scanline_for_scatter;
 7457-	void *scanline_scatter_buffer;
 7458-	void *scanline_scatter_buffer_end;
 7459-	int on_first_input_y, last_input_y;
 7460-	int width = (stbir_info->vertical_first)
 7461-	                ? (stbir_info->scanline_extents.conservative.n1 -
 7462-	                   stbir_info->scanline_extents.conservative.n0 + 1)
 7463-	                : stbir_info->horizontal.scale_info.output_sub_size;
 7464-	int width_times_channels = stbir_info->effective_channels * width;
 7465-
 7466-	STBIR_ASSERT(!stbir_info->vertical.is_gather);
 7467-
 7468-	start_output_y = split_info->start_output_y;
 7469-	end_output_y = split_info[split_count - 1]
 7470-	                   .end_output_y; // may do multiple split counts
 7471-
 7472-	start_input_y = split_info->start_input_y;
 7473-	end_input_y = split_info[split_count - 1].end_input_y;
 7474-
 7475-	// adjust for starting offset start_input_y
 7476-	y = start_input_y + stbir_info->vertical.filter_pixel_margin;
 7477-	vertical_contributors += y;
 7478-	vertical_coefficients += stbir_info->vertical.coefficient_width * y;
 7479-
 7480-	if (stbir_info->vertical_first) {
 7481-		handle_scanline_for_scatter =
 7482-		    stbir__horizontal_resample_and_encode_first_scanline_from_scatter;
 7483-		scanline_scatter_buffer = split_info->decode_buffer;
 7484-		scanline_scatter_buffer_end =
 7485-		    ((char *)scanline_scatter_buffer) +
 7486-		    sizeof(float) * stbir_info->effective_channels *
 7487-		        (stbir_info->scanline_extents.conservative.n1 -
 7488-		         stbir_info->scanline_extents.conservative.n0 + 1);
 7489-	} else {
 7490-		handle_scanline_for_scatter = stbir__encode_first_scanline_from_scatter;
 7491-		scanline_scatter_buffer = split_info->vertical_buffer;
 7492-		scanline_scatter_buffer_end =
 7493-		    ((char *)scanline_scatter_buffer) +
 7494-		    sizeof(float) * stbir_info->effective_channels *
 7495-		        stbir_info->horizontal.scale_info.output_sub_size;
 7496-	}
 7497-
 7498-	// initialize the ring buffer for scattering
 7499-	split_info->ring_buffer_first_scanline = start_output_y;
 7500-	split_info->ring_buffer_last_scanline = -1;
 7501-	split_info->ring_buffer_begin_index = -1;
 7502-
 7503-	// mark all the buffers as empty to start
 7504-	for (y = 0; y < stbir_info->ring_buffer_num_entries; y++) {
 7505-		float *decode_buffer =
 7506-		    stbir__get_ring_buffer_entry(stbir_info, split_info, y);
 7507-		decode_buffer[width_times_channels] =
 7508-		    0.0f; // clear two over for horizontals with a remnant of 3
 7509-		decode_buffer[width_times_channels + 1] = 0.0f;
 7510-		decode_buffer[0] = STBIR__FLOAT_EMPTY_MARKER; // only used on scatter
 7511-	}
 7512-
 7513-	// do the loop in input space
 7514-	on_first_input_y = 1;
 7515-	last_input_y = start_input_y;
 7516-	for (y = start_input_y; y < end_input_y; y++) {
 7517-		int out_first_scanline, out_last_scanline;
 7518-
 7519-		out_first_scanline = vertical_contributors->n0;
 7520-		out_last_scanline = vertical_contributors->n1;
 7521-
 7522-		STBIR_ASSERT(out_last_scanline - out_first_scanline + 1 <=
 7523-		             stbir_info->ring_buffer_num_entries);
 7524-
 7525-		if ((out_last_scanline >= out_first_scanline) &&
 7526-		    (((out_first_scanline >= start_output_y) &&
 7527-		      (out_first_scanline < end_output_y)) ||
 7528-		     ((out_last_scanline >= start_output_y) &&
 7529-		      (out_last_scanline < end_output_y)))) {
 7530-			float const *vc = vertical_coefficients;
 7531-
 7532-			// keep track of the range actually seen for the next resize
 7533-			last_input_y = y;
 7534-			if ((on_first_input_y) && (y > start_input_y)) {
 7535-				split_info->start_input_y = y;
 7536-			}
 7537-			on_first_input_y = 0;
 7538-
 7539-			// clip the region
 7540-			if (out_first_scanline < start_output_y) {
 7541-				vc += start_output_y - out_first_scanline;
 7542-				out_first_scanline = start_output_y;
 7543-			}
 7544-
 7545-			if (out_last_scanline >= end_output_y) {
 7546-				out_last_scanline = end_output_y - 1;
 7547-			}
 7548-
 7549-			// if very first scanline, init the index
 7550-			if (split_info->ring_buffer_begin_index < 0) {
 7551-				split_info->ring_buffer_begin_index =
 7552-				    out_first_scanline - start_output_y;
 7553-			}
 7554-
 7555-			STBIR_ASSERT(split_info->ring_buffer_begin_index <=
 7556-			             out_first_scanline);
 7557-
 7558-			// Decode the nth scanline from the source image into the decode
 7559-			// buffer.
 7560-			stbir__decode_scanline(
 7561-			    stbir_info, y,
 7562-			    split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO);
 7563-
 7564-			// When horizontal first, we resample horizontally into the vertical
 7565-			// buffer before we scatter it out
 7566-			if (!stbir_info->vertical_first) {
 7567-				stbir__resample_horizontal_gather(
 7568-				    stbir_info, split_info->vertical_buffer,
 7569-				    split_info
 7570-				        ->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO);
 7571-			}
 7572-
 7573-			// Now it's sitting in the buffer ready to be distributed into the
 7574-			// ring buffers.
 7575-
 7576-			// evict from the ringbuffer, if we need are full
 7577-			if (((split_info->ring_buffer_last_scanline -
 7578-			      split_info->ring_buffer_first_scanline + 1) ==
 7579-			     stbir_info->ring_buffer_num_entries) &&
 7580-			    (out_last_scanline > split_info->ring_buffer_last_scanline)) {
 7581-				handle_scanline_for_scatter(stbir_info, split_info);
 7582-			}
 7583-
 7584-			// Now the horizontal buffer is ready to write to all ring buffer
 7585-			// rows, so do it.
 7586-			stbir__resample_vertical_scatter(
 7587-			    stbir_info, split_info, out_first_scanline, out_last_scanline,
 7588-			    vc, (float *)scanline_scatter_buffer,
 7589-			    (float *)scanline_scatter_buffer_end);
 7590-
 7591-			// update the end of the buffer
 7592-			if (out_last_scanline > split_info->ring_buffer_last_scanline) {
 7593-				split_info->ring_buffer_last_scanline = out_last_scanline;
 7594-			}
 7595-		}
 7596-		++vertical_contributors;
 7597-		vertical_coefficients += stbir_info->vertical.coefficient_width;
 7598-	}
 7599-
 7600-	// now evict the scanlines that are left over in the ring buffer
 7601-	while (split_info->ring_buffer_first_scanline < end_output_y) {
 7602-		handle_scanline_for_scatter(stbir_info, split_info);
 7603-	}
 7604-
 7605-	// update the end_input_y if we do multiple resizes with the same data
 7606-	++last_input_y;
 7607-	for (y = 0; y < split_count; y++) {
 7608-		if (split_info[y].end_input_y > last_input_y) {
 7609-			split_info[y].end_input_y = last_input_y;
 7610-		}
 7611-	}
 7612-}
 7613-
 7614-static stbir__kernel_callback *stbir__builtin_kernels[] = {
 7615-    0,
 7616-    stbir__filter_trapezoid,
 7617-    stbir__filter_triangle,
 7618-    stbir__filter_cubic,
 7619-    stbir__filter_catmullrom,
 7620-    stbir__filter_mitchell,
 7621-    stbir__filter_point};
 7622-static stbir__support_callback *stbir__builtin_supports[] = {
 7623-    0,
 7624-    stbir__support_trapezoid,
 7625-    stbir__support_one,
 7626-    stbir__support_two,
 7627-    stbir__support_two,
 7628-    stbir__support_two,
 7629-    stbir__support_zeropoint5};
 7630-
 7631-static void
 7632-stbir__set_sampler(stbir__sampler *samp, stbir_filter filter,
 7633-                   stbir__kernel_callback *kernel,
 7634-                   stbir__support_callback *support, stbir_edge edge,
 7635-                   stbir__scale_info *scale_info, int always_gather,
 7636-                   void *user_data)
 7637-{
 7638-	// set filter
 7639-	if (filter == 0) {
 7640-		filter = STBIR_DEFAULT_FILTER_DOWNSAMPLE; // default to downsample
 7641-		if (scale_info->scale >= (1.0f - stbir__small_float)) {
 7642-			if ((scale_info->scale <= (1.0f + stbir__small_float)) &&
 7643-			    (STBIR_CEILF(scale_info->pixel_shift) ==
 7644-			     scale_info->pixel_shift)) {
 7645-				filter = STBIR_FILTER_POINT_SAMPLE;
 7646-			} else {
 7647-				filter = STBIR_DEFAULT_FILTER_UPSAMPLE;
 7648-			}
 7649-		}
 7650-	}
 7651-	samp->filter_enum = filter;
 7652-
 7653-	STBIR_ASSERT(samp->filter_enum != 0);
 7654-	STBIR_ASSERT((unsigned)samp->filter_enum < STBIR_FILTER_OTHER);
 7655-	samp->filter_kernel = stbir__builtin_kernels[filter];
 7656-	samp->filter_support = stbir__builtin_supports[filter];
 7657-
 7658-	if (kernel && support) {
 7659-		samp->filter_kernel = kernel;
 7660-		samp->filter_support = support;
 7661-		samp->filter_enum = STBIR_FILTER_OTHER;
 7662-	}
 7663-
 7664-	samp->edge = edge;
 7665-	samp->filter_pixel_width = stbir__get_filter_pixel_width(
 7666-	    samp->filter_support, scale_info->scale, user_data);
 7667-	// Gather is always better, but in extreme downsamples, you have to most or
 7668-	// all of the data in memory
 7669-	//    For horizontal, we always have all the pixels, so we always use gather
 7670-	//    here (always_gather==1). For vertical, we use gather if scaling up
 7671-	//    (which means we will have samp->filter_pixel_width scanlines in memory
 7672-	//    at once).
 7673-	samp->is_gather = 0;
 7674-	if (scale_info->scale >= (1.0f - stbir__small_float)) {
 7675-		samp->is_gather = 1;
 7676-	} else if ((always_gather) ||
 7677-	           (samp->filter_pixel_width <=
 7678-	            STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT)) {
 7679-		samp->is_gather = 2;
 7680-	}
 7681-
 7682-	// pre calculate stuff based on the above
 7683-	samp->coefficient_width =
 7684-	    stbir__get_coefficient_width(samp, samp->is_gather, user_data);
 7685-
 7686-	// filter_pixel_width is the conservative size in pixels of input that
 7687-	// affect an output pixel.
 7688-	//   In rare cases (only with 2 pix to 1 pix with the default filters), it's
 7689-	//   possible that the filter will extend before or after the scanline
 7690-	//   beyond just one extra entire copy of the scanline (we would hit the
 7691-	//   edge twice). We don't let you do that, so we clamp the total width to
 7692-	//   3x the total of input pixel (once for the scanline, once for the left
 7693-	//   side overhang, and once for the right side). We only do this for edge
 7694-	//   mode, since the other modes can just re-edge clamp back in again.
 7695-	if (edge == STBIR_EDGE_WRAP) {
 7696-		if (samp->filter_pixel_width > (scale_info->input_full_size * 3)) {
 7697-			samp->filter_pixel_width = scale_info->input_full_size * 3;
 7698-		}
 7699-	}
 7700-
 7701-	// This is how much to expand buffers to account for filters seeking outside
 7702-	// the image boundaries.
 7703-	samp->filter_pixel_margin = samp->filter_pixel_width / 2;
 7704-
 7705-	// filter_pixel_margin is the amount that this filter can overhang on just
 7706-	// one side of either
 7707-	//   end of the scanline (left or the right). Since we only allow you to
 7708-	//   overhang 1 scanline's worth of pixels, we clamp this one side of
 7709-	//   overhang to the input scanline size. Again, this clamping only happens
 7710-	//   in rare cases with the default filters (2 pix to 1 pix).
 7711-	if (edge == STBIR_EDGE_WRAP) {
 7712-		if (samp->filter_pixel_margin > scale_info->input_full_size) {
 7713-			samp->filter_pixel_margin = scale_info->input_full_size;
 7714-		}
 7715-	}
 7716-
 7717-	samp->num_contributors = stbir__get_contributors(samp, samp->is_gather);
 7718-
 7719-	samp->contributors_size =
 7720-	    samp->num_contributors * sizeof(stbir__contributors);
 7721-	samp->coefficients_size =
 7722-	    samp->num_contributors * samp->coefficient_width * sizeof(float) +
 7723-	    sizeof(float) *
 7724-	        STBIR_INPUT_CALLBACK_PADDING; // extra sizeof(float) is padding
 7725-
 7726-	samp->gather_prescatter_contributors = 0;
 7727-	samp->gather_prescatter_coefficients = 0;
 7728-	if (samp->is_gather == 0) {
 7729-		samp->gather_prescatter_coefficient_width = samp->filter_pixel_width;
 7730-		samp->gather_prescatter_num_contributors =
 7731-		    stbir__get_contributors(samp, 2);
 7732-		samp->gather_prescatter_contributors_size =
 7733-		    samp->gather_prescatter_num_contributors *
 7734-		    sizeof(stbir__contributors);
 7735-		samp->gather_prescatter_coefficients_size =
 7736-		    samp->gather_prescatter_num_contributors *
 7737-		    samp->gather_prescatter_coefficient_width * sizeof(float);
 7738-	}
 7739-}
 7740-
 7741-static void
 7742-stbir__get_conservative_extents(stbir__sampler *samp,
 7743-                                stbir__contributors *range, void *user_data)
 7744-{
 7745-	float scale = samp->scale_info.scale;
 7746-	float out_shift = samp->scale_info.pixel_shift;
 7747-	stbir__support_callback *support = samp->filter_support;
 7748-	int input_full_size = samp->scale_info.input_full_size;
 7749-	stbir_edge edge = samp->edge;
 7750-	float inv_scale = samp->scale_info.inv_scale;
 7751-
 7752-	STBIR_ASSERT(samp->is_gather != 0);
 7753-
 7754-	if (samp->is_gather == 1) {
 7755-		int in_first_pixel, in_last_pixel;
 7756-		float out_filter_radius = support(inv_scale, user_data) * scale;
 7757-
 7758-		stbir__calculate_in_pixel_range(&in_first_pixel, &in_last_pixel, 0.5,
 7759-		                                out_filter_radius, inv_scale, out_shift,
 7760-		                                input_full_size, edge);
 7761-		range->n0 = in_first_pixel;
 7762-		stbir__calculate_in_pixel_range(
 7763-		    &in_first_pixel, &in_last_pixel,
 7764-		    ((float)(samp->scale_info.output_sub_size - 1)) + 0.5f,
 7765-		    out_filter_radius, inv_scale, out_shift, input_full_size, edge);
 7766-		range->n1 = in_last_pixel;
 7767-	} else if (samp->is_gather == 2) // downsample gather, refine
 7768-	{
 7769-		float in_pixels_radius = support(scale, user_data) * inv_scale;
 7770-		int filter_pixel_margin = samp->filter_pixel_margin;
 7771-		int output_sub_size = samp->scale_info.output_sub_size;
 7772-		int input_end;
 7773-		int n;
 7774-		int in_first_pixel, in_last_pixel;
 7775-
 7776-		// get a conservative area of the input range
 7777-		stbir__calculate_in_pixel_range(&in_first_pixel, &in_last_pixel, 0, 0,
 7778-		                                inv_scale, out_shift, input_full_size,
 7779-		                                edge);
 7780-		range->n0 = in_first_pixel;
 7781-		stbir__calculate_in_pixel_range(&in_first_pixel, &in_last_pixel,
 7782-		                                (float)output_sub_size, 0, inv_scale,
 7783-		                                out_shift, input_full_size, edge);
 7784-		range->n1 = in_last_pixel;
 7785-
 7786-		// now go through the margin to the start of area to find bottom
 7787-		n = range->n0 + 1;
 7788-		input_end = -filter_pixel_margin;
 7789-		while (n >= input_end) {
 7790-			int out_first_pixel, out_last_pixel;
 7791-			stbir__calculate_out_pixel_range(
 7792-			    &out_first_pixel, &out_last_pixel, ((float)n) + 0.5f,
 7793-			    in_pixels_radius, scale, out_shift, output_sub_size);
 7794-			if (out_first_pixel > out_last_pixel) {
 7795-				break;
 7796-			}
 7797-
 7798-			if ((out_first_pixel < output_sub_size) || (out_last_pixel >= 0)) {
 7799-				range->n0 = n;
 7800-			}
 7801-			--n;
 7802-		}
 7803-
 7804-		// now go through the end of the area through the margin to find top
 7805-		n = range->n1 - 1;
 7806-		input_end = n + 1 + filter_pixel_margin;
 7807-		while (n <= input_end) {
 7808-			int out_first_pixel, out_last_pixel;
 7809-			stbir__calculate_out_pixel_range(
 7810-			    &out_first_pixel, &out_last_pixel, ((float)n) + 0.5f,
 7811-			    in_pixels_radius, scale, out_shift, output_sub_size);
 7812-			if (out_first_pixel > out_last_pixel) {
 7813-				break;
 7814-			}
 7815-			if ((out_first_pixel < output_sub_size) || (out_last_pixel >= 0)) {
 7816-				range->n1 = n;
 7817-			}
 7818-			++n;
 7819-		}
 7820-	}
 7821-
 7822-	if (samp->edge == STBIR_EDGE_WRAP) {
 7823-		// if we are wrapping, and we are very close to the image size (so the
 7824-		// edges might merge), just use the scanline up to the edge
 7825-		if ((range->n0 > 0) && (range->n1 >= input_full_size)) {
 7826-			int marg = range->n1 - input_full_size + 1;
 7827-			if ((marg + STBIR__MERGE_RUNS_PIXEL_THRESHOLD) >= range->n0) {
 7828-				range->n0 = 0;
 7829-			}
 7830-		}
 7831-		if ((range->n0 < 0) && (range->n1 < (input_full_size - 1))) {
 7832-			int marg = -range->n0;
 7833-			if ((input_full_size - marg - STBIR__MERGE_RUNS_PIXEL_THRESHOLD -
 7834-			     1) <= range->n1) {
 7835-				range->n1 = input_full_size - 1;
 7836-			}
 7837-		}
 7838-	} else {
 7839-		// for non-edge-wrap modes, we never read over the edge, so clamp
 7840-		if (range->n0 < 0) {
 7841-			range->n0 = 0;
 7842-		}
 7843-		if (range->n1 >= input_full_size) {
 7844-			range->n1 = input_full_size - 1;
 7845-		}
 7846-	}
 7847-}
 7848-
 7849-static void
 7850-stbir__get_split_info(stbir__per_split_info *split_info, int splits,
 7851-                      int output_height, int vertical_pixel_margin,
 7852-                      int input_full_height, int is_gather,
 7853-                      stbir__contributors *contribs)
 7854-{
 7855-	int i, cur;
 7856-	int left = output_height;
 7857-
 7858-	cur = 0;
 7859-	for (i = 0; i < splits; i++) {
 7860-		int each;
 7861-
 7862-		split_info[i].start_output_y = cur;
 7863-		each = left / (splits - i);
 7864-		split_info[i].end_output_y = cur + each;
 7865-
 7866-		// ok, when we are gathering, we need to make sure we are starting on a
 7867-		// y offset that doesn't have
 7868-		//   a "special" set of coefficients. Basically, with exactly the right
 7869-		//   filter at exactly the right resize at exactly the right phase, some
 7870-		//   of the coefficents can be zero. When they are zero, we don't
 7871-		//   process them at all.  But this leads to a tricky thing with the
 7872-		//   thread splits, where we might have a set of two coeffs like this
 7873-		//   for example: (4,4) and (3,6).  The 4,4 means there was just one
 7874-		//   single coeff because things worked out perfectly (normally, they
 7875-		//   all have 4 coeffs like the range 3,6.  The problem is that if we
 7876-		//   start right on the (4,4) on a brand new thread, then when we get to
 7877-		//   (3,6), we don't have the "3" sample in memory (because we didn't
 7878-		//   load it on the initial (4,4) range because it didn't have a 3 (we
 7879-		//   only add new samples that are larger than our existing samples -
 7880-		//   it's just how the eviction works). So, our solution here is pretty
 7881-		//   simple, if we start right on a range that has samples that start
 7882-		//   earlier, then we simply bump up our previous thread split range to
 7883-		//   include it, and then start this threads range with the smaller
 7884-		//   sample. It just moves one scanline from one thread split to
 7885-		//   another, so that we end with the unusual one, instead of start with
 7886-		//   it. To do this, we check 2-4 sample at each thread split start and
 7887-		//   then occassionally move them.
 7888-
 7889-		if ((is_gather) && (i)) {
 7890-			stbir__contributors *small_contribs;
 7891-			int j, smallest, stop, start_n0;
 7892-			stbir__contributors *split_contribs = contribs + cur;
 7893-
 7894-			// scan for a max of 3x the filter width or until the next thread
 7895-			// split
 7896-			stop = vertical_pixel_margin * 3;
 7897-			if (each < stop) {
 7898-				stop = each;
 7899-			}
 7900-
 7901-			// loops a few times before early out
 7902-			smallest = 0;
 7903-			small_contribs = split_contribs;
 7904-			start_n0 = small_contribs->n0;
 7905-			for (j = 1; j <= stop; j++) {
 7906-				++split_contribs;
 7907-				if (split_contribs->n0 > start_n0) {
 7908-					break;
 7909-				}
 7910-				if (split_contribs->n0 < small_contribs->n0) {
 7911-					small_contribs = split_contribs;
 7912-					smallest = j;
 7913-				}
 7914-			}
 7915-
 7916-			split_info[i - 1].end_output_y += smallest;
 7917-			split_info[i].start_output_y += smallest;
 7918-		}
 7919-
 7920-		cur += each;
 7921-		left -= each;
 7922-
 7923-		// scatter range (updated to minimum as you run it)
 7924-		split_info[i].start_input_y = -vertical_pixel_margin;
 7925-		split_info[i].end_input_y = input_full_height + vertical_pixel_margin;
 7926-	}
 7927-}
 7928-
 7929-static void
 7930-stbir__free_internal_mem(stbir__info *info)
 7931-{
 7932-#define STBIR__FREE_AND_CLEAR(ptr)                                             \
 7933-	{                                                                          \
 7934-		if (ptr) {                                                             \
 7935-			void *p = (ptr);                                                   \
 7936-			(ptr) = 0;                                                         \
 7937-			STBIR_FREE(p, info->user_data);                                    \
 7938-		}                                                                      \
 7939-	}
 7940-
 7941-	if (info) {
 7942-#ifndef STBIR__SEPARATE_ALLOCATIONS
 7943-		STBIR__FREE_AND_CLEAR(info->alloced_mem);
 7944-#else
 7945-		int i, j;
 7946-
 7947-		if ((info->vertical.gather_prescatter_contributors) &&
 7948-		    ((void *)info->vertical.gather_prescatter_contributors !=
 7949-		     (void *)info->split_info[0].decode_buffer)) {
 7950-			STBIR__FREE_AND_CLEAR(
 7951-			    info->vertical.gather_prescatter_coefficients);
 7952-			STBIR__FREE_AND_CLEAR(
 7953-			    info->vertical.gather_prescatter_contributors);
 7954-		}
 7955-		for (i = 0; i < info->splits; i++) {
 7956-			for (j = 0; j < info->alloc_ring_buffer_num_entries; j++) {
 7957-#ifdef STBIR_SIMD8
 7958-				if (info->effective_channels == 3) {
 7959-					--info->split_info[i]
 7960-					      .ring_buffers[j]; // avx in 3 channel mode needs one
 7961-					                        // float at the start of the buffer
 7962-				}
 7963-#endif
 7964-				STBIR__FREE_AND_CLEAR(info->split_info[i].ring_buffers[j]);
 7965-			}
 7966-
 7967-#ifdef STBIR_SIMD8
 7968-			if (info->effective_channels == 3) {
 7969-				--info->split_info[i]
 7970-				      .decode_buffer; // avx in 3 channel mode needs one float
 7971-				                      // at the start of the buffer
 7972-			}
 7973-#endif
 7974-			STBIR__FREE_AND_CLEAR(info->split_info[i].decode_buffer);
 7975-			STBIR__FREE_AND_CLEAR(info->split_info[i].ring_buffers);
 7976-			STBIR__FREE_AND_CLEAR(info->split_info[i].vertical_buffer);
 7977-		}
 7978-		STBIR__FREE_AND_CLEAR(info->split_info);
 7979-		if (info->vertical.coefficients != info->horizontal.coefficients) {
 7980-			STBIR__FREE_AND_CLEAR(info->vertical.coefficients);
 7981-			STBIR__FREE_AND_CLEAR(info->vertical.contributors);
 7982-		}
 7983-		STBIR__FREE_AND_CLEAR(info->horizontal.coefficients);
 7984-		STBIR__FREE_AND_CLEAR(info->horizontal.contributors);
 7985-		STBIR__FREE_AND_CLEAR(info->alloced_mem);
 7986-		STBIR_FREE(info, info->user_data);
 7987-#endif
 7988-	}
 7989-
 7990-#undef STBIR__FREE_AND_CLEAR
 7991-}
 7992-
 7993-static int
 7994-stbir__get_max_split(int splits, int height)
 7995-{
 7996-	int i;
 7997-	int max = 0;
 7998-
 7999-	for (i = 0; i < splits; i++) {
 8000-		int each = height / (splits - i);
 8001-		if (each > max) {
 8002-			max = each;
 8003-		}
 8004-		height -= each;
 8005-	}
 8006-	return max;
 8007-}
 8008-
 8009-static stbir__horizontal_gather_channels_func *
 8010-    *stbir__horizontal_gather_n_coeffs_funcs[8] = {
 8011-        0,
 8012-        stbir__horizontal_gather_1_channels_with_n_coeffs_funcs,
 8013-        stbir__horizontal_gather_2_channels_with_n_coeffs_funcs,
 8014-        stbir__horizontal_gather_3_channels_with_n_coeffs_funcs,
 8015-        stbir__horizontal_gather_4_channels_with_n_coeffs_funcs,
 8016-        0,
 8017-        0,
 8018-        stbir__horizontal_gather_7_channels_with_n_coeffs_funcs};
 8019-
 8020-static stbir__horizontal_gather_channels_func *
 8021-    *stbir__horizontal_gather_channels_funcs[8] = {
 8022-        0,
 8023-        stbir__horizontal_gather_1_channels_funcs,
 8024-        stbir__horizontal_gather_2_channels_funcs,
 8025-        stbir__horizontal_gather_3_channels_funcs,
 8026-        stbir__horizontal_gather_4_channels_funcs,
 8027-        0,
 8028-        0,
 8029-        stbir__horizontal_gather_7_channels_funcs};
 8030-
 8031-// there are six resize classifications: 0 == vertical scatter, 1 == vertical
 8032-// gather < 1x scale, 2 == vertical gather 1x-2x scale, 4 == vertical gather <
 8033-// 3x scale, 4 == vertical gather > 3x scale, 5 == <=4 pixel height, 6 == <=4
 8034-// pixel wide column
 8035-#define STBIR_RESIZE_CLASSIFICATIONS 8
 8036-
 8037-static float stbir__compute_weights[5][STBIR_RESIZE_CLASSIFICATIONS]
 8038-                                   [4] = // 5 = 0=1chan, 1=2chan, 2=3chan,
 8039-                                         // 3=4chan, 4=7chan
 8040-    {{
 8041-         {1.00000f, 1.00000f, 0.31250f, 1.00000f},
 8042-         {0.56250f, 0.59375f, 0.00000f, 0.96875f},
 8043-         {1.00000f, 0.06250f, 0.00000f, 1.00000f},
 8044-         {0.00000f, 0.09375f, 1.00000f, 1.00000f},
 8045-         {1.00000f, 1.00000f, 1.00000f, 1.00000f},
 8046-         {0.03125f, 0.12500f, 1.00000f, 1.00000f},
 8047-         {0.06250f, 0.12500f, 0.00000f, 1.00000f},
 8048-         {0.00000f, 1.00000f, 0.00000f, 0.03125f},
 8049-     },
 8050-     {
 8051-         {0.00000f, 0.84375f, 0.00000f, 0.03125f},
 8052-         {0.09375f, 0.93750f, 0.00000f, 0.78125f},
 8053-         {0.87500f, 0.21875f, 0.00000f, 0.96875f},
 8054-         {0.09375f, 0.09375f, 1.00000f, 1.00000f},
 8055-         {1.00000f, 1.00000f, 1.00000f, 1.00000f},
 8056-         {0.03125f, 0.12500f, 1.00000f, 1.00000f},
 8057-         {0.06250f, 0.12500f, 0.00000f, 1.00000f},
 8058-         {0.00000f, 1.00000f, 0.00000f, 0.53125f},
 8059-     },
 8060-     {
 8061-         {0.00000f, 0.53125f, 0.00000f, 0.03125f},
 8062-         {0.06250f, 0.96875f, 0.00000f, 0.53125f},
 8063-         {0.87500f, 0.18750f, 0.00000f, 0.93750f},
 8064-         {0.00000f, 0.09375f, 1.00000f, 1.00000f},
 8065-         {1.00000f, 1.00000f, 1.00000f, 1.00000f},
 8066-         {0.03125f, 0.12500f, 1.00000f, 1.00000f},
 8067-         {0.06250f, 0.12500f, 0.00000f, 1.00000f},
 8068-         {0.00000f, 1.00000f, 0.00000f, 0.56250f},
 8069-     },
 8070-     {
 8071-         {0.00000f, 0.50000f, 0.00000f, 0.71875f},
 8072-         {0.06250f, 0.84375f, 0.00000f, 0.87500f},
 8073-         {1.00000f, 0.50000f, 0.50000f, 0.96875f},
 8074-         {1.00000f, 0.09375f, 0.31250f, 0.50000f},
 8075-         {1.00000f, 1.00000f, 1.00000f, 1.00000f},
 8076-         {1.00000f, 0.03125f, 0.03125f, 0.53125f},
 8077-         {0.18750f, 0.12500f, 0.00000f, 1.00000f},
 8078-         {0.00000f, 1.00000f, 0.03125f, 0.18750f},
 8079-     },
 8080-     {
 8081-         {0.00000f, 0.59375f, 0.00000f, 0.96875f},
 8082-         {0.06250f, 0.81250f, 0.06250f, 0.59375f},
 8083-         {0.75000f, 0.43750f, 0.12500f, 0.96875f},
 8084-         {0.87500f, 0.06250f, 0.18750f, 0.43750f},
 8085-         {1.00000f, 1.00000f, 1.00000f, 1.00000f},
 8086-         {0.15625f, 0.12500f, 1.00000f, 1.00000f},
 8087-         {0.06250f, 0.12500f, 0.00000f, 1.00000f},
 8088-         {0.00000f, 1.00000f, 0.03125f, 0.34375f},
 8089-     }};
 8090-
 8091-// structure that allow us to query and override info for training the costs
 8092-typedef struct STBIR__V_FIRST_INFO {
 8093-	double v_cost, h_cost;
 8094-	int control_v_first; // 0 = no control, 1 = force hori, 2 = force vert
 8095-	int v_first;
 8096-	int v_resize_classification;
 8097-	int is_gather;
 8098-} STBIR__V_FIRST_INFO;
 8099-
 8100-#ifdef STBIR__V_FIRST_INFO_BUFFER
 8101-static STBIR__V_FIRST_INFO STBIR__V_FIRST_INFO_BUFFER = {0};
 8102-#define STBIR__V_FIRST_INFO_POINTER &STBIR__V_FIRST_INFO_BUFFER
 8103-#else
 8104-#define STBIR__V_FIRST_INFO_POINTER 0
 8105-#endif
 8106-
 8107-// Figure out whether to scale along the horizontal or vertical first.
 8108-//   This only *super* important when you are scaling by a massively
 8109-//   different amount in the vertical vs the horizontal (for example, if
 8110-//   you are scaling by 2x in the width, and 0.5x in the height, then you
 8111-//   want to do the vertical scale first, because it's around 3x faster
 8112-//   in that order.
 8113-//
 8114-//   In more normal circumstances, this makes a 20-40% differences, so
 8115-//     it's good to get right, but not critical. The normal way that you
 8116-//     decide which direction goes first is just figuring out which
 8117-//     direction does more multiplies. But with modern CPUs with their
 8118-//     fancy caches and SIMD and high IPC abilities, so there's just a lot
 8119-//     more that goes into it.
 8120-//
 8121-//   My handwavy sort of solution is to have an app that does a whole
 8122-//     bunch of timing for both vertical and horizontal first modes,
 8123-//     and then another app that can read lots of these timing files
 8124-//     and try to search for the best weights to use. Dotimings.c
 8125-//     is the app that does a bunch of timings, and vf_train.c is the
 8126-//     app that solves for the best weights (and shows how well it
 8127-//     does currently).
 8128-
 8129-static int
 8130-stbir__should_do_vertical_first(
 8131-    float weights_table[STBIR_RESIZE_CLASSIFICATIONS][4],
 8132-    int horizontal_filter_pixel_width, float horizontal_scale,
 8133-    int horizontal_output_size, int vertical_filter_pixel_width,
 8134-    float vertical_scale, int vertical_output_size, int is_gather,
 8135-    STBIR__V_FIRST_INFO *info)
 8136-{
 8137-	double v_cost, h_cost;
 8138-	float *weights;
 8139-	int vertical_first;
 8140-	int v_classification;
 8141-
 8142-	// categorize the resize into buckets
 8143-	if ((vertical_output_size <= 4) || (horizontal_output_size <= 4)) {
 8144-		v_classification =
 8145-		    (vertical_output_size < horizontal_output_size) ? 6 : 7;
 8146-	} else if (vertical_scale <= 1.0f) {
 8147-		v_classification = (is_gather) ? 1 : 0;
 8148-	} else if (vertical_scale <= 2.0f) {
 8149-		v_classification = 2;
 8150-	} else if (vertical_scale <= 3.0f) {
 8151-		v_classification = 3;
 8152-	} else if (vertical_scale <= 4.0f) {
 8153-		v_classification = 5;
 8154-	} else {
 8155-		v_classification = 6;
 8156-	}
 8157-
 8158-	// use the right weights
 8159-	weights = weights_table[v_classification];
 8160-
 8161-	// this is the costs when you don't take into account modern CPUs with high
 8162-	// ipc and simd and caches - wish we had a better estimate
 8163-	h_cost = (float)horizontal_filter_pixel_width * weights[0] +
 8164-	         horizontal_scale * (float)vertical_filter_pixel_width * weights[1];
 8165-	v_cost = (float)vertical_filter_pixel_width * weights[2] +
 8166-	         vertical_scale * (float)horizontal_filter_pixel_width * weights[3];
 8167-
 8168-	// use computation estimate to decide vertical first or not
 8169-	vertical_first = (v_cost <= h_cost) ? 1 : 0;
 8170-
 8171-	// save these, if requested
 8172-	if (info) {
 8173-		info->h_cost = h_cost;
 8174-		info->v_cost = v_cost;
 8175-		info->v_resize_classification = v_classification;
 8176-		info->v_first = vertical_first;
 8177-		info->is_gather = is_gather;
 8178-	}
 8179-
 8180-	// and this allows us to override everything for testing (see dotiming.c)
 8181-	if ((info) && (info->control_v_first)) {
 8182-		vertical_first = (info->control_v_first == 2) ? 1 : 0;
 8183-	}
 8184-
 8185-	return vertical_first;
 8186-}
 8187-
 8188-// layout lookups - must match stbir_internal_pixel_layout
 8189-static unsigned char stbir__pixel_channels[] = {
 8190-    1, 2, 3, 3, 4,    // 1ch, 2ch, rgb, bgr, 4ch
 8191-    4, 4, 4, 4, 2, 2, // RGBA,BGRA,ARGB,ABGR,RA,AR
 8192-    4, 4, 4, 4, 2, 2, // RGBA_PM,BGRA_PM,ARGB_PM,ABGR_PM,RA_PM,AR_PM
 8193-};
 8194-
 8195-// the internal pixel layout enums are in a different order, so we can easily do
 8196-// range comparisons of types
 8197-//   the public pixel layout is ordered in a way that if you cast num_channels
 8198-//   (1-4) to the enum, you get something sensible
 8199-static stbir_internal_pixel_layout
 8200-    stbir__pixel_layout_convert_public_to_internal[] = {
 8201-        STBIRI_BGR,     STBIRI_1CHANNEL, STBIRI_2CHANNEL, STBIRI_RGB,
 8202-        STBIRI_RGBA,    STBIRI_4CHANNEL, STBIRI_BGRA,     STBIRI_ARGB,
 8203-        STBIRI_ABGR,    STBIRI_RA,       STBIRI_AR,       STBIRI_RGBA_PM,
 8204-        STBIRI_BGRA_PM, STBIRI_ARGB_PM,  STBIRI_ABGR_PM,  STBIRI_RA_PM,
 8205-        STBIRI_AR_PM,
 8206-};
 8207-
 8208-static stbir__info *
 8209-stbir__alloc_internal_mem_and_build_samplers(
 8210-    stbir__sampler *horizontal, stbir__sampler *vertical,
 8211-    stbir__contributors *conservative,
 8212-    stbir_pixel_layout input_pixel_layout_public,
 8213-    stbir_pixel_layout output_pixel_layout_public, int splits, int new_x,
 8214-    int new_y, int fast_alpha,
 8215-    void *user_data STBIR_ONLY_PROFILE_BUILD_GET_INFO)
 8216-{
 8217-	static char stbir_channel_count_index[8] = {9, 0, 1, 2, 3, 9, 9, 4};
 8218-
 8219-	stbir__info *info = 0;
 8220-	void *alloced = 0;
 8221-	size_t alloced_total = 0;
 8222-	int vertical_first;
 8223-	size_t decode_buffer_size, ring_buffer_length_bytes, ring_buffer_size,
 8224-	    vertical_buffer_size;
 8225-	int alloc_ring_buffer_num_entries;
 8226-
 8227-	int alpha_weighting_type = 0; // 0=none, 1=simple, 2=fancy
 8228-	int conservative_split_output_size =
 8229-	    stbir__get_max_split(splits, vertical->scale_info.output_sub_size);
 8230-	stbir_internal_pixel_layout input_pixel_layout =
 8231-	    stbir__pixel_layout_convert_public_to_internal
 8232-	        [input_pixel_layout_public];
 8233-	stbir_internal_pixel_layout output_pixel_layout =
 8234-	    stbir__pixel_layout_convert_public_to_internal
 8235-	        [output_pixel_layout_public];
 8236-	int channels = stbir__pixel_channels[input_pixel_layout];
 8237-	int effective_channels = channels;
 8238-
 8239-	// first figure out what type of alpha weighting to use (if any)
 8240-	if ((horizontal->filter_enum != STBIR_FILTER_POINT_SAMPLE) ||
 8241-	    (vertical->filter_enum !=
 8242-	     STBIR_FILTER_POINT_SAMPLE)) // no alpha weighting on point sampling
 8243-	{
 8244-		if ((input_pixel_layout >= STBIRI_RGBA) &&
 8245-		    (input_pixel_layout <= STBIRI_AR) &&
 8246-		    (output_pixel_layout >= STBIRI_RGBA) &&
 8247-		    (output_pixel_layout <= STBIRI_AR)) {
 8248-			if (fast_alpha) {
 8249-				alpha_weighting_type = 4;
 8250-			} else {
 8251-				static int fancy_alpha_effective_cnts[6] = {7, 7, 7, 7, 3, 3};
 8252-				alpha_weighting_type = 2;
 8253-				effective_channels =
 8254-				    fancy_alpha_effective_cnts[input_pixel_layout -
 8255-				                               STBIRI_RGBA];
 8256-			}
 8257-		} else if ((input_pixel_layout >= STBIRI_RGBA_PM) &&
 8258-		           (input_pixel_layout <= STBIRI_AR_PM) &&
 8259-		           (output_pixel_layout >= STBIRI_RGBA) &&
 8260-		           (output_pixel_layout <= STBIRI_AR)) {
 8261-			// input premult, output non-premult
 8262-			alpha_weighting_type = 3;
 8263-		} else if ((input_pixel_layout >= STBIRI_RGBA) &&
 8264-		           (input_pixel_layout <= STBIRI_AR) &&
 8265-		           (output_pixel_layout >= STBIRI_RGBA_PM) &&
 8266-		           (output_pixel_layout <= STBIRI_AR_PM)) {
 8267-			// input non-premult, output premult
 8268-			alpha_weighting_type = 1;
 8269-		}
 8270-	}
 8271-
 8272-	// channel in and out count must match currently
 8273-	if (channels != stbir__pixel_channels[output_pixel_layout]) {
 8274-		return 0;
 8275-	}
 8276-
 8277-	// get vertical first
 8278-	vertical_first = stbir__should_do_vertical_first(
 8279-	    stbir__compute_weights[(
 8280-	        int)stbir_channel_count_index[effective_channels]],
 8281-	    horizontal->filter_pixel_width, horizontal->scale_info.scale,
 8282-	    horizontal->scale_info.output_sub_size, vertical->filter_pixel_width,
 8283-	    vertical->scale_info.scale, vertical->scale_info.output_sub_size,
 8284-	    vertical->is_gather, STBIR__V_FIRST_INFO_POINTER);
 8285-
 8286-	// sometimes read one float off in some of the unrolled loops (with a weight
 8287-	// of zero coeff, so it doesn't have an effect)
 8288-	//   we use a few extra floats instead of just 1, so that input callback
 8289-	//   buffer can overlap with the decode buffer without the conversion
 8290-	//   routines overwriting the callback input data.
 8291-	decode_buffer_size =
 8292-	    (conservative->n1 - conservative->n0 + 1) * effective_channels *
 8293-	        sizeof(float) +
 8294-	    sizeof(float) * STBIR_INPUT_CALLBACK_PADDING; // extra floats for input
 8295-	                                                  // callback stagger
 8296-
 8297-#if defined(STBIR__SEPARATE_ALLOCATIONS) && defined(STBIR_SIMD8)
 8298-	if (effective_channels == 3) {
 8299-		decode_buffer_size +=
 8300-		    sizeof(float); // avx in 3 channel mode needs one float at the start
 8301-		                   // of the buffer (only with separate allocations)
 8302-	}
 8303-#endif
 8304-
 8305-	ring_buffer_length_bytes =
 8306-	    (size_t)horizontal->scale_info.output_sub_size *
 8307-	        (size_t)effective_channels * sizeof(float) +
 8308-	    sizeof(float) *
 8309-	        STBIR_INPUT_CALLBACK_PADDING; // extra floats for padding
 8310-
 8311-	// if we do vertical first, the ring buffer holds a whole decoded line
 8312-	if (vertical_first) {
 8313-		ring_buffer_length_bytes = (decode_buffer_size + 15) & ~15;
 8314-	}
 8315-
 8316-	if ((ring_buffer_length_bytes & 4095) == 0) {
 8317-		ring_buffer_length_bytes += 64 * 3; // avoid 4k alias
 8318-	}
 8319-
 8320-	// One extra entry because floating point precision problems sometimes cause
 8321-	// an extra to be necessary.
 8322-	alloc_ring_buffer_num_entries = vertical->filter_pixel_width + 1;
 8323-
 8324-	// we never need more ring buffer entries than the scanlines we're
 8325-	// outputting when in scatter mode
 8326-	if ((!vertical->is_gather) &&
 8327-	    (alloc_ring_buffer_num_entries > conservative_split_output_size)) {
 8328-		alloc_ring_buffer_num_entries = conservative_split_output_size;
 8329-	}
 8330-
 8331-	ring_buffer_size = (size_t)alloc_ring_buffer_num_entries *
 8332-	                   (size_t)ring_buffer_length_bytes;
 8333-
 8334-	// The vertical buffer is used differently, depending on whether we are
 8335-	// scattering
 8336-	//   the vertical scanlines, or gathering them.
 8337-	//   If scattering, it's used at the temp buffer to accumulate each output.
 8338-	//   If gathering, it's just the output buffer.
 8339-	vertical_buffer_size = (size_t)horizontal->scale_info.output_sub_size *
 8340-	                           (size_t)effective_channels * sizeof(float) +
 8341-	                       sizeof(float); // extra float for padding
 8342-
 8343-	// we make two passes through this loop, 1st to add everything up, 2nd to
 8344-	// allocate and init
 8345-	for (;;) {
 8346-		int i;
 8347-		void *advance_mem = alloced;
 8348-		int copy_horizontal = 0;
 8349-		stbir__sampler *possibly_use_horizontal_for_pivot = 0;
 8350-
 8351-#ifdef STBIR__SEPARATE_ALLOCATIONS
 8352-#define STBIR__NEXT_PTR(ptr, size, ntype)                                      \
 8353-	if (alloced) {                                                             \
 8354-		void *p = STBIR_MALLOC(size, user_data);                               \
 8355-		if (p == 0) {                                                          \
 8356-			stbir__free_internal_mem(info);                                    \
 8357-			return 0;                                                          \
 8358-		}                                                                      \
 8359-		(ptr) = (ntype *)p;                                                    \
 8360-	}
 8361-#else
 8362-#define STBIR__NEXT_PTR(ptr, size, ntype)                                      \
 8363-	advance_mem = (void *)((((size_t)advance_mem) + 15) & ~15);                \
 8364-	if (alloced)                                                               \
 8365-		ptr = (ntype *)advance_mem;                                            \
 8366-	advance_mem = (char *)(((size_t)advance_mem) + (size));
 8367-#endif
 8368-
 8369-		STBIR__NEXT_PTR(info, sizeof(stbir__info), stbir__info);
 8370-
 8371-		STBIR__NEXT_PTR(info->split_info,
 8372-		                sizeof(stbir__per_split_info) * splits,
 8373-		                stbir__per_split_info);
 8374-
 8375-		if (info) {
 8376-			static stbir__alpha_weight_func *fancy_alpha_weights[6] = {
 8377-			    stbir__fancy_alpha_weight_4ch, stbir__fancy_alpha_weight_4ch,
 8378-			    stbir__fancy_alpha_weight_4ch, stbir__fancy_alpha_weight_4ch,
 8379-			    stbir__fancy_alpha_weight_2ch, stbir__fancy_alpha_weight_2ch};
 8380-			static stbir__alpha_unweight_func *fancy_alpha_unweights[6] = {
 8381-			    stbir__fancy_alpha_unweight_4ch,
 8382-			    stbir__fancy_alpha_unweight_4ch,
 8383-			    stbir__fancy_alpha_unweight_4ch,
 8384-			    stbir__fancy_alpha_unweight_4ch,
 8385-			    stbir__fancy_alpha_unweight_2ch,
 8386-			    stbir__fancy_alpha_unweight_2ch};
 8387-			static stbir__alpha_weight_func *simple_alpha_weights[6] = {
 8388-			    stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch,
 8389-			    stbir__simple_alpha_weight_4ch, stbir__simple_alpha_weight_4ch,
 8390-			    stbir__simple_alpha_weight_2ch, stbir__simple_alpha_weight_2ch};
 8391-			static stbir__alpha_unweight_func *simple_alpha_unweights[6] = {
 8392-			    stbir__simple_alpha_unweight_4ch,
 8393-			    stbir__simple_alpha_unweight_4ch,
 8394-			    stbir__simple_alpha_unweight_4ch,
 8395-			    stbir__simple_alpha_unweight_4ch,
 8396-			    stbir__simple_alpha_unweight_2ch,
 8397-			    stbir__simple_alpha_unweight_2ch};
 8398-
 8399-			// initialize info fields
 8400-			info->alloced_mem = alloced;
 8401-			info->alloced_total = alloced_total;
 8402-
 8403-			info->channels = channels;
 8404-			info->effective_channels = effective_channels;
 8405-
 8406-			info->offset_x = new_x;
 8407-			info->offset_y = new_y;
 8408-			info->alloc_ring_buffer_num_entries =
 8409-			    (int)alloc_ring_buffer_num_entries;
 8410-			info->ring_buffer_num_entries = 0;
 8411-			info->ring_buffer_length_bytes = (int)ring_buffer_length_bytes;
 8412-			info->splits = splits;
 8413-			info->vertical_first = vertical_first;
 8414-
 8415-			info->input_pixel_layout_internal = input_pixel_layout;
 8416-			info->output_pixel_layout_internal = output_pixel_layout;
 8417-
 8418-			// setup alpha weight functions
 8419-			info->alpha_weight = 0;
 8420-			info->alpha_unweight = 0;
 8421-
 8422-			// handle alpha weighting functions and overrides
 8423-			if (alpha_weighting_type == 2) {
 8424-				// high quality alpha multiplying on the way in, dividing on the
 8425-				// way out
 8426-				info->alpha_weight =
 8427-				    fancy_alpha_weights[input_pixel_layout - STBIRI_RGBA];
 8428-				info->alpha_unweight =
 8429-				    fancy_alpha_unweights[output_pixel_layout - STBIRI_RGBA];
 8430-			} else if (alpha_weighting_type == 4) {
 8431-				// fast alpha multiplying on the way in, dividing on the way out
 8432-				info->alpha_weight =
 8433-				    simple_alpha_weights[input_pixel_layout - STBIRI_RGBA];
 8434-				info->alpha_unweight =
 8435-				    simple_alpha_unweights[output_pixel_layout - STBIRI_RGBA];
 8436-			} else if (alpha_weighting_type == 1) {
 8437-				// fast alpha on the way in, leave in premultiplied form on way
 8438-				// out
 8439-				info->alpha_weight =
 8440-				    simple_alpha_weights[input_pixel_layout - STBIRI_RGBA];
 8441-			} else if (alpha_weighting_type == 3) {
 8442-				// incoming is premultiplied, fast alpha dividing on the way out
 8443-				// - non-premultiplied output
 8444-				info->alpha_unweight =
 8445-				    simple_alpha_unweights[output_pixel_layout - STBIRI_RGBA];
 8446-			}
 8447-
 8448-			// handle 3-chan color flipping, using the alpha weight path
 8449-			if (((input_pixel_layout == STBIRI_RGB) &&
 8450-			     (output_pixel_layout == STBIRI_BGR)) ||
 8451-			    ((input_pixel_layout == STBIRI_BGR) &&
 8452-			     (output_pixel_layout == STBIRI_RGB))) {
 8453-				// do the flipping on the smaller of the two ends
 8454-				if (horizontal->scale_info.scale < 1.0f) {
 8455-					info->alpha_unweight = stbir__simple_flip_3ch;
 8456-				} else {
 8457-					info->alpha_weight = stbir__simple_flip_3ch;
 8458-				}
 8459-			}
 8460-		}
 8461-
 8462-		// get all the per-split buffers
 8463-		for (i = 0; i < splits; i++) {
 8464-			STBIR__NEXT_PTR(info->split_info[i].decode_buffer,
 8465-			                decode_buffer_size, float);
 8466-
 8467-#ifdef STBIR__SEPARATE_ALLOCATIONS
 8468-
 8469-#ifdef STBIR_SIMD8
 8470-			if ((info) && (effective_channels == 3)) {
 8471-				++info->split_info[i]
 8472-				      .decode_buffer; // avx in 3 channel mode needs one float
 8473-				                      // at the start of the buffer
 8474-			}
 8475-#endif
 8476-
 8477-			STBIR__NEXT_PTR(info->split_info[i].ring_buffers,
 8478-			                alloc_ring_buffer_num_entries * sizeof(float *),
 8479-			                float *);
 8480-			{
 8481-				int j;
 8482-				for (j = 0; j < alloc_ring_buffer_num_entries; j++) {
 8483-					STBIR__NEXT_PTR(info->split_info[i].ring_buffers[j],
 8484-					                ring_buffer_length_bytes, float);
 8485-#ifdef STBIR_SIMD8
 8486-					if ((info) && (effective_channels == 3)) {
 8487-						++info->split_info[i]
 8488-						      .ring_buffers[j]; // avx in 3 channel mode needs
 8489-						                        // one float at the start of the
 8490-						                        // buffer
 8491-					}
 8492-#endif
 8493-				}
 8494-			}
 8495-#else
 8496-			STBIR__NEXT_PTR(info->split_info[i].ring_buffer, ring_buffer_size,
 8497-			                float);
 8498-#endif
 8499-			STBIR__NEXT_PTR(info->split_info[i].vertical_buffer,
 8500-			                vertical_buffer_size, float);
 8501-		}
 8502-
 8503-		// alloc memory for to-be-pivoted coeffs (if necessary)
 8504-		if (vertical->is_gather == 0) {
 8505-			size_t both;
 8506-			size_t temp_mem_amt;
 8507-
 8508-			// when in vertical scatter mode, we first build the coefficients in
 8509-			// gather mode, and then pivot after,
 8510-			//   that means we need two buffers, so we try to use the decode
 8511-			//   buffer and ring buffer for this. if that is too small, we just
 8512-			//   allocate extra memory to use as this temp.
 8513-
 8514-			both = (size_t)vertical->gather_prescatter_contributors_size +
 8515-			       (size_t)vertical->gather_prescatter_coefficients_size;
 8516-
 8517-#ifdef STBIR__SEPARATE_ALLOCATIONS
 8518-			temp_mem_amt = decode_buffer_size;
 8519-
 8520-#ifdef STBIR_SIMD8
 8521-			if (effective_channels == 3) {
 8522-				--temp_mem_amt; // avx in 3 channel mode needs one float at the
 8523-				                // start of the buffer
 8524-			}
 8525-#endif
 8526-#else
 8527-			temp_mem_amt = (size_t)(decode_buffer_size + ring_buffer_size +
 8528-			                        vertical_buffer_size) *
 8529-			               (size_t)splits;
 8530-#endif
 8531-			if (temp_mem_amt >= both) {
 8532-				if (info) {
 8533-					vertical->gather_prescatter_contributors =
 8534-					    (stbir__contributors *)info->split_info[0]
 8535-					        .decode_buffer;
 8536-					vertical->gather_prescatter_coefficients =
 8537-					    (float *)(((char *)info->split_info[0].decode_buffer) +
 8538-					              vertical
 8539-					                  ->gather_prescatter_contributors_size);
 8540-				}
 8541-			} else {
 8542-				// ring+decode memory is too small, so allocate temp memory
 8543-				STBIR__NEXT_PTR(vertical->gather_prescatter_contributors,
 8544-				                vertical->gather_prescatter_contributors_size,
 8545-				                stbir__contributors);
 8546-				STBIR__NEXT_PTR(vertical->gather_prescatter_coefficients,
 8547-				                vertical->gather_prescatter_coefficients_size,
 8548-				                float);
 8549-			}
 8550-		}
 8551-
 8552-		STBIR__NEXT_PTR(horizontal->contributors, horizontal->contributors_size,
 8553-		                stbir__contributors);
 8554-		STBIR__NEXT_PTR(horizontal->coefficients, horizontal->coefficients_size,
 8555-		                float);
 8556-
 8557-		// are the two filters identical?? (happens a lot with mipmap
 8558-		// generation)
 8559-		if ((horizontal->filter_kernel == vertical->filter_kernel) &&
 8560-		    (horizontal->filter_support == vertical->filter_support) &&
 8561-		    (horizontal->edge == vertical->edge) &&
 8562-		    (horizontal->scale_info.output_sub_size ==
 8563-		     vertical->scale_info.output_sub_size)) {
 8564-			float diff_scale =
 8565-			    horizontal->scale_info.scale - vertical->scale_info.scale;
 8566-			float diff_shift = horizontal->scale_info.pixel_shift -
 8567-			                   vertical->scale_info.pixel_shift;
 8568-			if (diff_scale < 0.0f) {
 8569-				diff_scale = -diff_scale;
 8570-			}
 8571-			if (diff_shift < 0.0f) {
 8572-				diff_shift = -diff_shift;
 8573-			}
 8574-			if ((diff_scale <= stbir__small_float) &&
 8575-			    (diff_shift <= stbir__small_float)) {
 8576-				if (horizontal->is_gather == vertical->is_gather) {
 8577-					copy_horizontal = 1;
 8578-					goto no_vert_alloc;
 8579-				}
 8580-				// everything matches, but vertical is scatter, horizontal is
 8581-				// gather, use horizontal coeffs for vertical pivot coeffs
 8582-				possibly_use_horizontal_for_pivot = horizontal;
 8583-			}
 8584-		}
 8585-
 8586-		STBIR__NEXT_PTR(vertical->contributors, vertical->contributors_size,
 8587-		                stbir__contributors);
 8588-		STBIR__NEXT_PTR(vertical->coefficients, vertical->coefficients_size,
 8589-		                float);
 8590-
 8591-	no_vert_alloc:
 8592-
 8593-		if (info) {
 8594-			STBIR_PROFILE_BUILD_START(horizontal);
 8595-
 8596-			stbir__calculate_filters(
 8597-			    horizontal, 0, user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO);
 8598-
 8599-			// setup the horizontal gather functions
 8600-			// start with defaulting to the n_coeffs functions (specialized on
 8601-			// channels and remnant leftover)
 8602-			info->horizontal_gather_channels =
 8603-			    stbir__horizontal_gather_n_coeffs_funcs
 8604-			        [effective_channels][horizontal->extent_info.widest & 3];
 8605-			// but if the number of coeffs <= 12, use another set of special
 8606-			// cases. <=12 coeffs is any enlarging resize, or shrinking resize
 8607-			// down to about 1/3 size
 8608-			if (horizontal->extent_info.widest <= 12) {
 8609-				info->horizontal_gather_channels =
 8610-				    stbir__horizontal_gather_channels_funcs
 8611-				        [effective_channels]
 8612-				        [horizontal->extent_info.widest - 1];
 8613-			}
 8614-
 8615-			info->scanline_extents.conservative.n0 = conservative->n0;
 8616-			info->scanline_extents.conservative.n1 = conservative->n1;
 8617-
 8618-			// get exact extents
 8619-			stbir__get_extents(horizontal, &info->scanline_extents);
 8620-
 8621-			// pack the horizontal coeffs
 8622-			horizontal->coefficient_width = stbir__pack_coefficients(
 8623-			    horizontal->num_contributors, horizontal->contributors,
 8624-			    horizontal->coefficients, horizontal->coefficient_width,
 8625-			    horizontal->extent_info.widest,
 8626-			    info->scanline_extents.conservative.n0,
 8627-			    info->scanline_extents.conservative.n1);
 8628-
 8629-			STBIR_MEMCPY(&info->horizontal, horizontal, sizeof(stbir__sampler));
 8630-
 8631-			STBIR_PROFILE_BUILD_END(horizontal);
 8632-
 8633-			if (copy_horizontal) {
 8634-				STBIR_MEMCPY(&info->vertical, horizontal,
 8635-				             sizeof(stbir__sampler));
 8636-			} else {
 8637-				STBIR_PROFILE_BUILD_START(vertical);
 8638-
 8639-				stbir__calculate_filters(
 8640-				    vertical, possibly_use_horizontal_for_pivot,
 8641-				    user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO);
 8642-				STBIR_MEMCPY(&info->vertical, vertical, sizeof(stbir__sampler));
 8643-
 8644-				STBIR_PROFILE_BUILD_END(vertical);
 8645-			}
 8646-
 8647-			// setup the vertical split ranges
 8648-			stbir__get_split_info(info->split_info, info->splits,
 8649-			                      info->vertical.scale_info.output_sub_size,
 8650-			                      info->vertical.filter_pixel_margin,
 8651-			                      info->vertical.scale_info.input_full_size,
 8652-			                      info->vertical.is_gather,
 8653-			                      info->vertical.contributors);
 8654-
 8655-			// now we know precisely how many entries we need
 8656-			info->ring_buffer_num_entries = info->vertical.extent_info.widest;
 8657-
 8658-			// we never need more ring buffer entries than the scanlines we're
 8659-			// outputting
 8660-			if ((!info->vertical.is_gather) &&
 8661-			    (info->ring_buffer_num_entries >
 8662-			     conservative_split_output_size)) {
 8663-				info->ring_buffer_num_entries = conservative_split_output_size;
 8664-			}
 8665-			STBIR_ASSERT(info->ring_buffer_num_entries <=
 8666-			             info->alloc_ring_buffer_num_entries);
 8667-		}
 8668-#undef STBIR__NEXT_PTR
 8669-
 8670-		// is this the first time through loop?
 8671-		if (info == 0) {
 8672-			alloced_total = (15 + (size_t)advance_mem);
 8673-			alloced = STBIR_MALLOC(alloced_total, user_data);
 8674-			if (alloced == 0) {
 8675-				return 0;
 8676-			}
 8677-		} else {
 8678-			return info; // success
 8679-		}
 8680-	}
 8681-}
 8682-
 8683-static int
 8684-stbir__perform_resize(stbir__info const *info, int split_start, int split_count)
 8685-{
 8686-	stbir__per_split_info *split_info = info->split_info + split_start;
 8687-
 8688-	STBIR_PROFILE_CLEAR_EXTRAS();
 8689-
 8690-	STBIR_PROFILE_FIRST_START(looping);
 8691-	if (info->vertical.is_gather) {
 8692-		stbir__vertical_gather_loop(info, split_info, split_count);
 8693-	} else {
 8694-		stbir__vertical_scatter_loop(info, split_info, split_count);
 8695-	}
 8696-	STBIR_PROFILE_END(looping);
 8697-
 8698-	return 1;
 8699-}
 8700-
 8701-static void
 8702-stbir__update_info_from_resize(stbir__info *info, STBIR_RESIZE *resize)
 8703-{
 8704-	static stbir__decode_pixels_func
 8705-	    *decode_simple[STBIR_TYPE_HALF_FLOAT - STBIR_TYPE_UINT8_SRGB + 1] = {
 8706-	        /* 1ch-4ch */ stbir__decode_uint8_srgb,
 8707-	        stbir__decode_uint8_srgb,
 8708-	        0,
 8709-	        stbir__decode_float_linear,
 8710-	        stbir__decode_half_float_linear,
 8711-	    };
 8712-
 8713-	static stbir__decode_pixels_func
 8714-	    *decode_alphas[STBIRI_AR - STBIRI_RGBA +
 8715-	                   1][STBIR_TYPE_HALF_FLOAT - STBIR_TYPE_UINT8_SRGB + 1] = {
 8716-	        {/* RGBA */ stbir__decode_uint8_srgb4_linearalpha,
 8717-	         stbir__decode_uint8_srgb, 0, stbir__decode_float_linear,
 8718-	         stbir__decode_half_float_linear},
 8719-	        {/* BGRA */ stbir__decode_uint8_srgb4_linearalpha_BGRA,
 8720-	         stbir__decode_uint8_srgb_BGRA, 0, stbir__decode_float_linear_BGRA,
 8721-	         stbir__decode_half_float_linear_BGRA},
 8722-	        {/* ARGB */ stbir__decode_uint8_srgb4_linearalpha_ARGB,
 8723-	         stbir__decode_uint8_srgb_ARGB, 0, stbir__decode_float_linear_ARGB,
 8724-	         stbir__decode_half_float_linear_ARGB},
 8725-	        {/* ABGR */ stbir__decode_uint8_srgb4_linearalpha_ABGR,
 8726-	         stbir__decode_uint8_srgb_ABGR, 0, stbir__decode_float_linear_ABGR,
 8727-	         stbir__decode_half_float_linear_ABGR},
 8728-	        {/* RA   */ stbir__decode_uint8_srgb2_linearalpha,
 8729-	         stbir__decode_uint8_srgb, 0, stbir__decode_float_linear,
 8730-	         stbir__decode_half_float_linear},
 8731-	        {/* AR   */ stbir__decode_uint8_srgb2_linearalpha_AR,
 8732-	         stbir__decode_uint8_srgb_AR, 0, stbir__decode_float_linear_AR,
 8733-	         stbir__decode_half_float_linear_AR},
 8734-	    };
 8735-
 8736-	static stbir__decode_pixels_func *decode_simple_scaled_or_not[2][2] = {
 8737-	    {stbir__decode_uint8_linear_scaled, stbir__decode_uint8_linear},
 8738-	    {stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear},
 8739-	};
 8740-
 8741-	static stbir__decode_pixels_func
 8742-	    *decode_alphas_scaled_or_not[STBIRI_AR - STBIRI_RGBA + 1][2][2] = {
 8743-	        {/* RGBA */ {stbir__decode_uint8_linear_scaled,
 8744-	                     stbir__decode_uint8_linear},
 8745-	         {stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear}},
 8746-	        {/* BGRA */ {stbir__decode_uint8_linear_scaled_BGRA,
 8747-	                     stbir__decode_uint8_linear_BGRA},
 8748-	         {stbir__decode_uint16_linear_scaled_BGRA,
 8749-	          stbir__decode_uint16_linear_BGRA}},
 8750-	        {/* ARGB */ {stbir__decode_uint8_linear_scaled_ARGB,
 8751-	                     stbir__decode_uint8_linear_ARGB},
 8752-	         {stbir__decode_uint16_linear_scaled_ARGB,
 8753-	          stbir__decode_uint16_linear_ARGB}},
 8754-	        {/* ABGR */ {stbir__decode_uint8_linear_scaled_ABGR,
 8755-	                     stbir__decode_uint8_linear_ABGR},
 8756-	         {stbir__decode_uint16_linear_scaled_ABGR,
 8757-	          stbir__decode_uint16_linear_ABGR}},
 8758-	        {/* RA   */ {stbir__decode_uint8_linear_scaled,
 8759-	                     stbir__decode_uint8_linear},
 8760-	         {stbir__decode_uint16_linear_scaled, stbir__decode_uint16_linear}},
 8761-	        {/* AR   */ {stbir__decode_uint8_linear_scaled_AR,
 8762-	                     stbir__decode_uint8_linear_AR},
 8763-	         {stbir__decode_uint16_linear_scaled_AR,
 8764-	          stbir__decode_uint16_linear_AR}}};
 8765-
 8766-	static stbir__encode_pixels_func
 8767-	    *encode_simple[STBIR_TYPE_HALF_FLOAT - STBIR_TYPE_UINT8_SRGB + 1] = {
 8768-	        /* 1ch-4ch */ stbir__encode_uint8_srgb,
 8769-	        stbir__encode_uint8_srgb,
 8770-	        0,
 8771-	        stbir__encode_float_linear,
 8772-	        stbir__encode_half_float_linear,
 8773-	    };
 8774-
 8775-	static stbir__encode_pixels_func
 8776-	    *encode_alphas[STBIRI_AR - STBIRI_RGBA +
 8777-	                   1][STBIR_TYPE_HALF_FLOAT - STBIR_TYPE_UINT8_SRGB + 1] = {
 8778-	        {/* RGBA */ stbir__encode_uint8_srgb4_linearalpha,
 8779-	         stbir__encode_uint8_srgb, 0, stbir__encode_float_linear,
 8780-	         stbir__encode_half_float_linear},
 8781-	        {/* BGRA */ stbir__encode_uint8_srgb4_linearalpha_BGRA,
 8782-	         stbir__encode_uint8_srgb_BGRA, 0, stbir__encode_float_linear_BGRA,
 8783-	         stbir__encode_half_float_linear_BGRA},
 8784-	        {/* ARGB */ stbir__encode_uint8_srgb4_linearalpha_ARGB,
 8785-	         stbir__encode_uint8_srgb_ARGB, 0, stbir__encode_float_linear_ARGB,
 8786-	         stbir__encode_half_float_linear_ARGB},
 8787-	        {/* ABGR */ stbir__encode_uint8_srgb4_linearalpha_ABGR,
 8788-	         stbir__encode_uint8_srgb_ABGR, 0, stbir__encode_float_linear_ABGR,
 8789-	         stbir__encode_half_float_linear_ABGR},
 8790-	        {/* RA   */ stbir__encode_uint8_srgb2_linearalpha,
 8791-	         stbir__encode_uint8_srgb, 0, stbir__encode_float_linear,
 8792-	         stbir__encode_half_float_linear},
 8793-	        {/* AR   */ stbir__encode_uint8_srgb2_linearalpha_AR,
 8794-	         stbir__encode_uint8_srgb_AR, 0, stbir__encode_float_linear_AR,
 8795-	         stbir__encode_half_float_linear_AR}};
 8796-
 8797-	static stbir__encode_pixels_func *encode_simple_scaled_or_not[2][2] = {
 8798-	    {stbir__encode_uint8_linear_scaled, stbir__encode_uint8_linear},
 8799-	    {stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear},
 8800-	};
 8801-
 8802-	static stbir__encode_pixels_func
 8803-	    *encode_alphas_scaled_or_not[STBIRI_AR - STBIRI_RGBA + 1][2][2] = {
 8804-	        {/* RGBA */ {stbir__encode_uint8_linear_scaled,
 8805-	                     stbir__encode_uint8_linear},
 8806-	         {stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear}},
 8807-	        {/* BGRA */ {stbir__encode_uint8_linear_scaled_BGRA,
 8808-	                     stbir__encode_uint8_linear_BGRA},
 8809-	         {stbir__encode_uint16_linear_scaled_BGRA,
 8810-	          stbir__encode_uint16_linear_BGRA}},
 8811-	        {/* ARGB */ {stbir__encode_uint8_linear_scaled_ARGB,
 8812-	                     stbir__encode_uint8_linear_ARGB},
 8813-	         {stbir__encode_uint16_linear_scaled_ARGB,
 8814-	          stbir__encode_uint16_linear_ARGB}},
 8815-	        {/* ABGR */ {stbir__encode_uint8_linear_scaled_ABGR,
 8816-	                     stbir__encode_uint8_linear_ABGR},
 8817-	         {stbir__encode_uint16_linear_scaled_ABGR,
 8818-	          stbir__encode_uint16_linear_ABGR}},
 8819-	        {/* RA   */ {stbir__encode_uint8_linear_scaled,
 8820-	                     stbir__encode_uint8_linear},
 8821-	         {stbir__encode_uint16_linear_scaled, stbir__encode_uint16_linear}},
 8822-	        {/* AR   */ {stbir__encode_uint8_linear_scaled_AR,
 8823-	                     stbir__encode_uint8_linear_AR},
 8824-	         {stbir__encode_uint16_linear_scaled_AR,
 8825-	          stbir__encode_uint16_linear_AR}}};
 8826-
 8827-	stbir__decode_pixels_func *decode_pixels = 0;
 8828-	stbir__encode_pixels_func *encode_pixels = 0;
 8829-	stbir_datatype input_type, output_type;
 8830-
 8831-	input_type = resize->input_data_type;
 8832-	output_type = resize->output_data_type;
 8833-	info->input_data = resize->input_pixels;
 8834-	info->input_stride_bytes = resize->input_stride_in_bytes;
 8835-	info->output_stride_bytes = resize->output_stride_in_bytes;
 8836-
 8837-	// if we're completely point sampling, then we can turn off SRGB
 8838-	if ((info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE) &&
 8839-	    (info->vertical.filter_enum == STBIR_FILTER_POINT_SAMPLE)) {
 8840-		if (((input_type == STBIR_TYPE_UINT8_SRGB) ||
 8841-		     (input_type == STBIR_TYPE_UINT8_SRGB_ALPHA)) &&
 8842-		    ((output_type == STBIR_TYPE_UINT8_SRGB) ||
 8843-		     (output_type == STBIR_TYPE_UINT8_SRGB_ALPHA))) {
 8844-			input_type = STBIR_TYPE_UINT8;
 8845-			output_type = STBIR_TYPE_UINT8;
 8846-		}
 8847-	}
 8848-
 8849-	// recalc the output and input strides
 8850-	if (info->input_stride_bytes == 0) {
 8851-		info->input_stride_bytes = info->channels *
 8852-		                           info->horizontal.scale_info.input_full_size *
 8853-		                           stbir__type_size[input_type];
 8854-	}
 8855-
 8856-	if (info->output_stride_bytes == 0) {
 8857-		info->output_stride_bytes =
 8858-		    info->channels * info->horizontal.scale_info.output_sub_size *
 8859-		    stbir__type_size[output_type];
 8860-	}
 8861-
 8862-	// calc offset
 8863-	info->output_data =
 8864-	    ((char *)resize->output_pixels) +
 8865-	    ((size_t)info->offset_y * (size_t)resize->output_stride_in_bytes) +
 8866-	    (info->offset_x * info->channels * stbir__type_size[output_type]);
 8867-
 8868-	info->in_pixels_cb = resize->input_cb;
 8869-	info->user_data = resize->user_data;
 8870-	info->out_pixels_cb = resize->output_cb;
 8871-
 8872-	// setup the input format converters
 8873-	if ((input_type == STBIR_TYPE_UINT8) || (input_type == STBIR_TYPE_UINT16)) {
 8874-		int non_scaled = 0;
 8875-
 8876-		// check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0
 8877-		// (which is a tiny bit faster when doing linear 8->8 or 16->16)
 8878-		if ((!info->alpha_weight) &&
 8879-		    (!info->alpha_unweight)) { // don't short circuit when alpha
 8880-			                           // weighting (get everything to 0-1.0 as
 8881-			                           // usual)
 8882-			if (((input_type == STBIR_TYPE_UINT8) &&
 8883-			     (output_type == STBIR_TYPE_UINT8)) ||
 8884-			    ((input_type == STBIR_TYPE_UINT16) &&
 8885-			     (output_type == STBIR_TYPE_UINT16))) {
 8886-				non_scaled = 1;
 8887-			}
 8888-		}
 8889-
 8890-		if (info->input_pixel_layout_internal <= STBIRI_4CHANNEL) {
 8891-			decode_pixels =
 8892-			    decode_simple_scaled_or_not[input_type == STBIR_TYPE_UINT16]
 8893-			                               [non_scaled];
 8894-		} else {
 8895-			decode_pixels =
 8896-			    decode_alphas_scaled_or_not[(info->input_pixel_layout_internal -
 8897-			                                 STBIRI_RGBA) %
 8898-			                                (STBIRI_AR - STBIRI_RGBA + 1)]
 8899-			                               [input_type == STBIR_TYPE_UINT16]
 8900-			                               [non_scaled];
 8901-		}
 8902-	} else {
 8903-		if (info->input_pixel_layout_internal <= STBIRI_4CHANNEL) {
 8904-			decode_pixels = decode_simple[input_type - STBIR_TYPE_UINT8_SRGB];
 8905-		} else {
 8906-			decode_pixels = decode_alphas[(info->input_pixel_layout_internal -
 8907-			                               STBIRI_RGBA) %
 8908-			                              (STBIRI_AR - STBIRI_RGBA + 1)]
 8909-			                             [input_type - STBIR_TYPE_UINT8_SRGB];
 8910-		}
 8911-	}
 8912-
 8913-	// setup the output format converters
 8914-	if ((output_type == STBIR_TYPE_UINT8) ||
 8915-	    (output_type == STBIR_TYPE_UINT16)) {
 8916-		int non_scaled = 0;
 8917-
 8918-		// check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0
 8919-		// (which is a tiny bit faster when doing linear 8->8 or 16->16)
 8920-		if ((!info->alpha_weight) &&
 8921-		    (!info->alpha_unweight)) { // don't short circuit when alpha
 8922-			                           // weighting (get everything to 0-1.0 as
 8923-			                           // usual)
 8924-			if (((input_type == STBIR_TYPE_UINT8) &&
 8925-			     (output_type == STBIR_TYPE_UINT8)) ||
 8926-			    ((input_type == STBIR_TYPE_UINT16) &&
 8927-			     (output_type == STBIR_TYPE_UINT16))) {
 8928-				non_scaled = 1;
 8929-			}
 8930-		}
 8931-
 8932-		if (info->output_pixel_layout_internal <= STBIRI_4CHANNEL) {
 8933-			encode_pixels =
 8934-			    encode_simple_scaled_or_not[output_type == STBIR_TYPE_UINT16]
 8935-			                               [non_scaled];
 8936-		} else {
 8937-			encode_pixels = encode_alphas_scaled_or_not
 8938-			    [(info->output_pixel_layout_internal - STBIRI_RGBA) %
 8939-			     (STBIRI_AR - STBIRI_RGBA + 1)]
 8940-			    [output_type == STBIR_TYPE_UINT16][non_scaled];
 8941-		}
 8942-	} else {
 8943-		if (info->output_pixel_layout_internal <= STBIRI_4CHANNEL) {
 8944-			encode_pixels = encode_simple[output_type - STBIR_TYPE_UINT8_SRGB];
 8945-		} else {
 8946-			encode_pixels = encode_alphas[(info->output_pixel_layout_internal -
 8947-			                               STBIRI_RGBA) %
 8948-			                              (STBIRI_AR - STBIRI_RGBA + 1)]
 8949-			                             [output_type - STBIR_TYPE_UINT8_SRGB];
 8950-		}
 8951-	}
 8952-
 8953-	info->input_type = input_type;
 8954-	info->output_type = output_type;
 8955-	info->decode_pixels = decode_pixels;
 8956-	info->encode_pixels = encode_pixels;
 8957-}
 8958-
 8959-static void
 8960-stbir__clip(int *outx, int *outsubw, int outw, double *u0, double *u1)
 8961-{
 8962-	double per, adj;
 8963-	int over;
 8964-
 8965-	// do left/top edge
 8966-	if (*outx < 0) {
 8967-		per = ((double)*outx) / ((double)*outsubw); // is negative
 8968-		adj = per * (*u1 - *u0);
 8969-		*u0 -= adj; // increases u0
 8970-		*outx = 0;
 8971-	}
 8972-
 8973-	// do right/bot edge
 8974-	over = outw - (*outx + *outsubw);
 8975-	if (over < 0) {
 8976-		per = ((double)over) / ((double)*outsubw); // is negative
 8977-		adj = per * (*u1 - *u0);
 8978-		*u1 += adj; // decrease u1
 8979-		*outsubw = outw - *outx;
 8980-	}
 8981-}
 8982-
 8983-// converts a double to a rational that has less than one float bit of error
 8984-// (returns 0 if unable to do so)
 8985-static int
 8986-stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32 *numer,
 8987-                          stbir_uint32 *denom,
 8988-                          int limit_denom) // limit_denom (1) or limit numer (0)
 8989-{
 8990-	double err;
 8991-	stbir_uint64 top, bot;
 8992-	stbir_uint64 numer_last = 0;
 8993-	stbir_uint64 denom_last = 1;
 8994-	stbir_uint64 numer_estimate = 1;
 8995-	stbir_uint64 denom_estimate = 0;
 8996-
 8997-	// scale to past float error range
 8998-	top = (stbir_uint64)(f * (double)(1 << 25));
 8999-	bot = 1 << 25;
 9000-
 9001-	// keep refining, but usually stops in a few loops - usually 5 for bad cases
 9002-	for (;;) {
 9003-		stbir_uint64 est, temp;
 9004-
 9005-		// hit limit, break out and do best full range estimate
 9006-		if (((limit_denom) ? denom_estimate : numer_estimate) >= limit) {
 9007-			break;
 9008-		}
 9009-
 9010-		// is the current error less than 1 bit of a float? if so, we're done
 9011-		if (denom_estimate) {
 9012-			err = ((double)numer_estimate / (double)denom_estimate) - f;
 9013-			if (err < 0.0) {
 9014-				err = -err;
 9015-			}
 9016-			if (err < (1.0 / (double)(1 << 24))) {
 9017-				// yup, found it
 9018-				*numer = (stbir_uint32)numer_estimate;
 9019-				*denom = (stbir_uint32)denom_estimate;
 9020-				return 1;
 9021-			}
 9022-		}
 9023-
 9024-		// no more refinement bits left? break out and do full range estimate
 9025-		if (bot == 0) {
 9026-			break;
 9027-		}
 9028-
 9029-		// gcd the estimate bits
 9030-		est = top / bot;
 9031-		temp = top % bot;
 9032-		top = bot;
 9033-		bot = temp;
 9034-
 9035-		// move remainders
 9036-		temp = est * denom_estimate + denom_last;
 9037-		denom_last = denom_estimate;
 9038-		denom_estimate = temp;
 9039-
 9040-		// move remainders
 9041-		temp = est * numer_estimate + numer_last;
 9042-		numer_last = numer_estimate;
 9043-		numer_estimate = temp;
 9044-	}
 9045-
 9046-	// we didn't fine anything good enough for float, use a full range estimate
 9047-	if (limit_denom) {
 9048-		numer_estimate = (stbir_uint64)(f * (double)limit + 0.5);
 9049-		denom_estimate = limit;
 9050-	} else {
 9051-		numer_estimate = limit;
 9052-		denom_estimate = (stbir_uint64)(((double)limit / f) + 0.5);
 9053-	}
 9054-
 9055-	*numer = (stbir_uint32)numer_estimate;
 9056-	*denom = (stbir_uint32)denom_estimate;
 9057-
 9058-	err = (denom_estimate) ? (((double)(stbir_uint32)numer_estimate /
 9059-	                           (double)(stbir_uint32)denom_estimate) -
 9060-	                          f)
 9061-	                       : 1.0;
 9062-	if (err < 0.0) {
 9063-		err = -err;
 9064-	}
 9065-	return (err < (1.0 / (double)(1 << 24))) ? 1 : 0;
 9066-}
 9067-
 9068-static int
 9069-stbir__calculate_region_transform(stbir__scale_info *scale_info,
 9070-                                  int output_full_range, int *output_offset,
 9071-                                  int output_sub_range, int input_full_range,
 9072-                                  double input_s0, double input_s1)
 9073-{
 9074-	double output_range, input_range, output_s, input_s, ratio, scale;
 9075-
 9076-	input_s = input_s1 - input_s0;
 9077-
 9078-	// null area
 9079-	if ((output_full_range == 0) || (input_full_range == 0) ||
 9080-	    (output_sub_range == 0) || (input_s <= stbir__small_float)) {
 9081-		return 0;
 9082-	}
 9083-
 9084-	// are either of the ranges completely out of bounds?
 9085-	if ((*output_offset >= output_full_range) ||
 9086-	    ((*output_offset + output_sub_range) <= 0) ||
 9087-	    (input_s0 >= (1.0f - stbir__small_float)) ||
 9088-	    (input_s1 <= stbir__small_float)) {
 9089-		return 0;
 9090-	}
 9091-
 9092-	output_range = (double)output_full_range;
 9093-	input_range = (double)input_full_range;
 9094-
 9095-	output_s = ((double)output_sub_range) / output_range;
 9096-
 9097-	// figure out the scaling to use
 9098-	ratio = output_s / input_s;
 9099-
 9100-	// save scale before clipping
 9101-	scale = (output_range / input_range) * ratio;
 9102-	scale_info->scale = (float)scale;
 9103-	scale_info->inv_scale = (float)(1.0 / scale);
 9104-
 9105-	// clip output area to left/right output edges (and adjust input area)
 9106-	stbir__clip(output_offset, &output_sub_range, output_full_range, &input_s0,
 9107-	            &input_s1);
 9108-
 9109-	// recalc input area
 9110-	input_s = input_s1 - input_s0;
 9111-
 9112-	// after clipping do we have zero input area?
 9113-	if (input_s <= stbir__small_float) {
 9114-		return 0;
 9115-	}
 9116-
 9117-	// calculate and store the starting source offsets in output pixel space
 9118-	scale_info->pixel_shift = (float)(input_s0 * ratio * output_range);
 9119-
 9120-	scale_info->scale_is_rational = stbir__double_to_rational(
 9121-	    scale, (scale <= 1.0) ? output_full_range : input_full_range,
 9122-	    &scale_info->scale_numerator, &scale_info->scale_denominator,
 9123-	    (scale >= 1.0));
 9124-
 9125-	scale_info->input_full_size = input_full_range;
 9126-	scale_info->output_sub_size = output_sub_range;
 9127-
 9128-	return 1;
 9129-}
 9130-
 9131-static void
 9132-stbir__init_and_set_layout(STBIR_RESIZE *resize,
 9133-                           stbir_pixel_layout pixel_layout,
 9134-                           stbir_datatype data_type)
 9135-{
 9136-	resize->input_cb = 0;
 9137-	resize->output_cb = 0;
 9138-	resize->user_data = resize;
 9139-	resize->samplers = 0;
 9140-	resize->called_alloc = 0;
 9141-	resize->horizontal_filter = STBIR_FILTER_DEFAULT;
 9142-	resize->horizontal_filter_kernel = 0;
 9143-	resize->horizontal_filter_support = 0;
 9144-	resize->vertical_filter = STBIR_FILTER_DEFAULT;
 9145-	resize->vertical_filter_kernel = 0;
 9146-	resize->vertical_filter_support = 0;
 9147-	resize->horizontal_edge = STBIR_EDGE_CLAMP;
 9148-	resize->vertical_edge = STBIR_EDGE_CLAMP;
 9149-	resize->input_s0 = 0;
 9150-	resize->input_t0 = 0;
 9151-	resize->input_s1 = 1;
 9152-	resize->input_t1 = 1;
 9153-	resize->output_subx = 0;
 9154-	resize->output_suby = 0;
 9155-	resize->output_subw = resize->output_w;
 9156-	resize->output_subh = resize->output_h;
 9157-	resize->input_data_type = data_type;
 9158-	resize->output_data_type = data_type;
 9159-	resize->input_pixel_layout_public = pixel_layout;
 9160-	resize->output_pixel_layout_public = pixel_layout;
 9161-	resize->needs_rebuild = 1;
 9162-}
 9163-
 9164-STBIRDEF void
 9165-stbir_resize_init(STBIR_RESIZE *resize, const void *input_pixels, int input_w,
 9166-                  int input_h, int input_stride_in_bytes, // stride can be zero
 9167-                  void *output_pixels, int output_w, int output_h,
 9168-                  int output_stride_in_bytes, // stride can be zero
 9169-                  stbir_pixel_layout pixel_layout, stbir_datatype data_type)
 9170-{
 9171-	resize->input_pixels = input_pixels;
 9172-	resize->input_w = input_w;
 9173-	resize->input_h = input_h;
 9174-	resize->input_stride_in_bytes = input_stride_in_bytes;
 9175-	resize->output_pixels = output_pixels;
 9176-	resize->output_w = output_w;
 9177-	resize->output_h = output_h;
 9178-	resize->output_stride_in_bytes = output_stride_in_bytes;
 9179-	resize->fast_alpha = 0;
 9180-
 9181-	stbir__init_and_set_layout(resize, pixel_layout, data_type);
 9182-}
 9183-
 9184-// You can update parameters any time after resize_init
 9185-STBIRDEF void
 9186-stbir_set_datatypes(
 9187-    STBIR_RESIZE *resize, stbir_datatype input_type,
 9188-    stbir_datatype output_type) // by default, datatype from resize_init
 9189-{
 9190-	resize->input_data_type = input_type;
 9191-	resize->output_data_type = output_type;
 9192-	if ((resize->samplers) && (!resize->needs_rebuild)) {
 9193-		stbir__update_info_from_resize(resize->samplers, resize);
 9194-	}
 9195-}
 9196-
 9197-STBIRDEF void
 9198-stbir_set_pixel_callbacks(
 9199-    STBIR_RESIZE *resize, stbir_input_callback *input_cb,
 9200-    stbir_output_callback *output_cb) // no callbacks by default
 9201-{
 9202-	resize->input_cb = input_cb;
 9203-	resize->output_cb = output_cb;
 9204-
 9205-	if ((resize->samplers) && (!resize->needs_rebuild)) {
 9206-		resize->samplers->in_pixels_cb = input_cb;
 9207-		resize->samplers->out_pixels_cb = output_cb;
 9208-	}
 9209-}
 9210-
 9211-STBIRDEF void
 9212-stbir_set_user_data(STBIR_RESIZE *resize,
 9213-                    void *user_data) // pass back STBIR_RESIZE* by default
 9214-{
 9215-	resize->user_data = user_data;
 9216-	if ((resize->samplers) && (!resize->needs_rebuild)) {
 9217-		resize->samplers->user_data = user_data;
 9218-	}
 9219-}
 9220-
 9221-STBIRDEF void
 9222-stbir_set_buffer_ptrs(STBIR_RESIZE *resize, const void *input_pixels,
 9223-                      int input_stride_in_bytes, void *output_pixels,
 9224-                      int output_stride_in_bytes)
 9225-{
 9226-	resize->input_pixels = input_pixels;
 9227-	resize->input_stride_in_bytes = input_stride_in_bytes;
 9228-	resize->output_pixels = output_pixels;
 9229-	resize->output_stride_in_bytes = output_stride_in_bytes;
 9230-	if ((resize->samplers) && (!resize->needs_rebuild)) {
 9231-		stbir__update_info_from_resize(resize->samplers, resize);
 9232-	}
 9233-}
 9234-
 9235-STBIRDEF int
 9236-stbir_set_edgemodes(STBIR_RESIZE *resize, stbir_edge horizontal_edge,
 9237-                    stbir_edge vertical_edge) // CLAMP by default
 9238-{
 9239-	resize->horizontal_edge = horizontal_edge;
 9240-	resize->vertical_edge = vertical_edge;
 9241-	resize->needs_rebuild = 1;
 9242-	return 1;
 9243-}
 9244-
 9245-STBIRDEF int
 9246-stbir_set_filters(STBIR_RESIZE *resize, stbir_filter horizontal_filter,
 9247-                  stbir_filter vertical_filter) // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE
 9248-                                                // by default
 9249-{
 9250-	resize->horizontal_filter = horizontal_filter;
 9251-	resize->vertical_filter = vertical_filter;
 9252-	resize->needs_rebuild = 1;
 9253-	return 1;
 9254-}
 9255-
 9256-STBIRDEF int
 9257-stbir_set_filter_callbacks(STBIR_RESIZE *resize,
 9258-                           stbir__kernel_callback *horizontal_filter,
 9259-                           stbir__support_callback *horizontal_support,
 9260-                           stbir__kernel_callback *vertical_filter,
 9261-                           stbir__support_callback *vertical_support)
 9262-{
 9263-	resize->horizontal_filter_kernel = horizontal_filter;
 9264-	resize->horizontal_filter_support = horizontal_support;
 9265-	resize->vertical_filter_kernel = vertical_filter;
 9266-	resize->vertical_filter_support = vertical_support;
 9267-	resize->needs_rebuild = 1;
 9268-	return 1;
 9269-}
 9270-
 9271-STBIRDEF int
 9272-stbir_set_pixel_layouts(
 9273-    STBIR_RESIZE *resize, stbir_pixel_layout input_pixel_layout,
 9274-    stbir_pixel_layout output_pixel_layout) // sets new pixel layouts
 9275-{
 9276-	resize->input_pixel_layout_public = input_pixel_layout;
 9277-	resize->output_pixel_layout_public = output_pixel_layout;
 9278-	resize->needs_rebuild = 1;
 9279-	return 1;
 9280-}
 9281-
 9282-STBIRDEF int
 9283-stbir_set_non_pm_alpha_speed_over_quality(
 9284-    STBIR_RESIZE *resize,
 9285-    int non_pma_alpha_speed_over_quality) // sets alpha speed
 9286-{
 9287-	resize->fast_alpha = non_pma_alpha_speed_over_quality;
 9288-	resize->needs_rebuild = 1;
 9289-	return 1;
 9290-}
 9291-
 9292-STBIRDEF int
 9293-stbir_set_input_subrect(STBIR_RESIZE *resize, double s0, double t0, double s1,
 9294-                        double t1) // sets input region (full region by default)
 9295-{
 9296-	resize->input_s0 = s0;
 9297-	resize->input_t0 = t0;
 9298-	resize->input_s1 = s1;
 9299-	resize->input_t1 = t1;
 9300-	resize->needs_rebuild = 1;
 9301-
 9302-	// are we inbounds?
 9303-	if ((s1 < stbir__small_float) || ((s1 - s0) < stbir__small_float) ||
 9304-	    (t1 < stbir__small_float) || ((t1 - t0) < stbir__small_float) ||
 9305-	    (s0 > (1.0f - stbir__small_float)) ||
 9306-	    (t0 > (1.0f - stbir__small_float))) {
 9307-		return 0;
 9308-	}
 9309-
 9310-	return 1;
 9311-}
 9312-
 9313-STBIRDEF int
 9314-stbir_set_output_pixel_subrect(
 9315-    STBIR_RESIZE *resize, int subx, int suby, int subw,
 9316-    int subh) // sets input region (full region by default)
 9317-{
 9318-	resize->output_subx = subx;
 9319-	resize->output_suby = suby;
 9320-	resize->output_subw = subw;
 9321-	resize->output_subh = subh;
 9322-	resize->needs_rebuild = 1;
 9323-
 9324-	// are we inbounds?
 9325-	if ((subx >= resize->output_w) || ((subx + subw) <= 0) ||
 9326-	    (suby >= resize->output_h) || ((suby + subh) <= 0) || (subw == 0) ||
 9327-	    (subh == 0)) {
 9328-		return 0;
 9329-	}
 9330-
 9331-	return 1;
 9332-}
 9333-
 9334-STBIRDEF int
 9335-stbir_set_pixel_subrect(STBIR_RESIZE *resize, int subx, int suby, int subw,
 9336-                        int subh) // sets both regions (full regions by default)
 9337-{
 9338-	double s0, t0, s1, t1;
 9339-
 9340-	s0 = ((double)subx) / ((double)resize->output_w);
 9341-	t0 = ((double)suby) / ((double)resize->output_h);
 9342-	s1 = ((double)(subx + subw)) / ((double)resize->output_w);
 9343-	t1 = ((double)(suby + subh)) / ((double)resize->output_h);
 9344-
 9345-	resize->input_s0 = s0;
 9346-	resize->input_t0 = t0;
 9347-	resize->input_s1 = s1;
 9348-	resize->input_t1 = t1;
 9349-	resize->output_subx = subx;
 9350-	resize->output_suby = suby;
 9351-	resize->output_subw = subw;
 9352-	resize->output_subh = subh;
 9353-	resize->needs_rebuild = 1;
 9354-
 9355-	// are we inbounds?
 9356-	if ((subx >= resize->output_w) || ((subx + subw) <= 0) ||
 9357-	    (suby >= resize->output_h) || ((suby + subh) <= 0) || (subw == 0) ||
 9358-	    (subh == 0)) {
 9359-		return 0;
 9360-	}
 9361-
 9362-	return 1;
 9363-}
 9364-
 9365-static int
 9366-stbir__perform_build(STBIR_RESIZE *resize, int splits)
 9367-{
 9368-	stbir__contributors conservative = {0, 0};
 9369-	stbir__sampler horizontal, vertical;
 9370-	int new_output_subx, new_output_suby;
 9371-	stbir__info *out_info;
 9372-#ifdef STBIR_PROFILE
 9373-	stbir__info profile_infod; // used to contain building profile info before
 9374-	                           // everything is allocated
 9375-	stbir__info *profile_info = &profile_infod;
 9376-#endif
 9377-
 9378-	// have we already built the samplers?
 9379-	if (resize->samplers) {
 9380-		return 0;
 9381-	}
 9382-
 9383-#define STBIR_RETURN_ERROR_AND_ASSERT(exp)                                     \
 9384-	STBIR_ASSERT(!(exp));                                                      \
 9385-	if (exp)                                                                   \
 9386-		return 0;
 9387-	STBIR_RETURN_ERROR_AND_ASSERT((unsigned)resize->horizontal_filter >=
 9388-	                              STBIR_FILTER_OTHER)
 9389-	STBIR_RETURN_ERROR_AND_ASSERT((unsigned)resize->vertical_filter >=
 9390-	                              STBIR_FILTER_OTHER)
 9391-#undef STBIR_RETURN_ERROR_AND_ASSERT
 9392-
 9393-	if (splits <= 0) {
 9394-		return 0;
 9395-	}
 9396-
 9397-	STBIR_PROFILE_BUILD_FIRST_START(build);
 9398-
 9399-	new_output_subx = resize->output_subx;
 9400-	new_output_suby = resize->output_suby;
 9401-
 9402-	// do horizontal clip and scale calcs
 9403-	if (!stbir__calculate_region_transform(
 9404-	        &horizontal.scale_info, resize->output_w, &new_output_subx,
 9405-	        resize->output_subw, resize->input_w, resize->input_s0,
 9406-	        resize->input_s1)) {
 9407-		return 0;
 9408-	}
 9409-
 9410-	// do vertical clip and scale calcs
 9411-	if (!stbir__calculate_region_transform(
 9412-	        &vertical.scale_info, resize->output_h, &new_output_suby,
 9413-	        resize->output_subh, resize->input_h, resize->input_t0,
 9414-	        resize->input_t1)) {
 9415-		return 0;
 9416-	}
 9417-
 9418-	// if nothing to do, just return
 9419-	if ((horizontal.scale_info.output_sub_size == 0) ||
 9420-	    (vertical.scale_info.output_sub_size == 0)) {
 9421-		return 0;
 9422-	}
 9423-
 9424-	stbir__set_sampler(
 9425-	    &horizontal, resize->horizontal_filter,
 9426-	    resize->horizontal_filter_kernel, resize->horizontal_filter_support,
 9427-	    resize->horizontal_edge, &horizontal.scale_info, 1, resize->user_data);
 9428-	stbir__get_conservative_extents(&horizontal, &conservative,
 9429-	                                resize->user_data);
 9430-	stbir__set_sampler(&vertical, resize->vertical_filter,
 9431-	                   resize->vertical_filter_kernel,
 9432-	                   resize->vertical_filter_support, resize->vertical_edge,
 9433-	                   &vertical.scale_info, 0, resize->user_data);
 9434-
 9435-	if ((vertical.scale_info.output_sub_size / splits) <
 9436-	    STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS) // each split should be a
 9437-	                                              // minimum of 4 scanlines
 9438-	                                              // (handwavey choice)
 9439-	{
 9440-		splits = vertical.scale_info.output_sub_size /
 9441-		         STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS;
 9442-		if (splits == 0) {
 9443-			splits = 1;
 9444-		}
 9445-	}
 9446-
 9447-	STBIR_PROFILE_BUILD_START(alloc);
 9448-	out_info = stbir__alloc_internal_mem_and_build_samplers(
 9449-	    &horizontal, &vertical, &conservative,
 9450-	    resize->input_pixel_layout_public, resize->output_pixel_layout_public,
 9451-	    splits, new_output_subx, new_output_suby, resize->fast_alpha,
 9452-	    resize->user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO);
 9453-	STBIR_PROFILE_BUILD_END(alloc);
 9454-	STBIR_PROFILE_BUILD_END(build);
 9455-
 9456-	if (out_info) {
 9457-		resize->splits = splits;
 9458-		resize->samplers = out_info;
 9459-		resize->needs_rebuild = 0;
 9460-#ifdef STBIR_PROFILE
 9461-		STBIR_MEMCPY(&out_info->profile, &profile_infod.profile,
 9462-		             sizeof(out_info->profile));
 9463-#endif
 9464-
 9465-		// update anything that can be changed without recalcing samplers
 9466-		stbir__update_info_from_resize(out_info, resize);
 9467-
 9468-		return splits;
 9469-	}
 9470-
 9471-	return 0;
 9472-}
 9473-
 9474-void
 9475-stbir_free_samplers(STBIR_RESIZE *resize)
 9476-{
 9477-	if (resize->samplers) {
 9478-		stbir__free_internal_mem(resize->samplers);
 9479-		resize->samplers = 0;
 9480-		resize->called_alloc = 0;
 9481-	}
 9482-}
 9483-
 9484-STBIRDEF int
 9485-stbir_build_samplers_with_splits(STBIR_RESIZE *resize, int splits)
 9486-{
 9487-	if ((resize->samplers == 0) || (resize->needs_rebuild)) {
 9488-		if (resize->samplers) {
 9489-			stbir_free_samplers(resize);
 9490-		}
 9491-
 9492-		resize->called_alloc = 1;
 9493-		return stbir__perform_build(resize, splits);
 9494-	}
 9495-
 9496-	STBIR_PROFILE_BUILD_CLEAR(resize->samplers);
 9497-
 9498-	return 1;
 9499-}
 9500-
 9501-STBIRDEF int
 9502-stbir_build_samplers(STBIR_RESIZE *resize)
 9503-{
 9504-	return stbir_build_samplers_with_splits(resize, 1);
 9505-}
 9506-
 9507-STBIRDEF int
 9508-stbir_resize_extended(STBIR_RESIZE *resize)
 9509-{
 9510-	int result;
 9511-
 9512-	if ((resize->samplers == 0) || (resize->needs_rebuild)) {
 9513-		int alloc_state = resize->called_alloc; // remember allocated state
 9514-
 9515-		if (resize->samplers) {
 9516-			stbir__free_internal_mem(resize->samplers);
 9517-			resize->samplers = 0;
 9518-		}
 9519-
 9520-		if (!stbir_build_samplers(resize)) {
 9521-			return 0;
 9522-		}
 9523-
 9524-		resize->called_alloc = alloc_state;
 9525-
 9526-		// if build_samplers succeeded (above), but there are no samplers set,
 9527-		// then
 9528-		//   the area to stretch into was zero pixels, so don't do anything and
 9529-		//   return success
 9530-		if (resize->samplers == 0) {
 9531-			return 1;
 9532-		}
 9533-	} else {
 9534-		// didn't build anything - clear it
 9535-		STBIR_PROFILE_BUILD_CLEAR(resize->samplers);
 9536-	}
 9537-
 9538-	// do resize
 9539-	result = stbir__perform_resize(resize->samplers, 0, resize->splits);
 9540-
 9541-	// if we alloced, then free
 9542-	if (!resize->called_alloc) {
 9543-		stbir_free_samplers(resize);
 9544-		resize->samplers = 0;
 9545-	}
 9546-
 9547-	return result;
 9548-}
 9549-
 9550-STBIRDEF int
 9551-stbir_resize_extended_split(STBIR_RESIZE *resize, int split_start,
 9552-                            int split_count)
 9553-{
 9554-	STBIR_ASSERT(resize->samplers);
 9555-
 9556-	// if we're just doing the whole thing, call full
 9557-	if ((split_start == -1) ||
 9558-	    ((split_start == 0) && (split_count == resize->splits))) {
 9559-		return stbir_resize_extended(resize);
 9560-	}
 9561-
 9562-	// you **must** build samplers first when using split resize
 9563-	if ((resize->samplers == 0) || (resize->needs_rebuild)) {
 9564-		return 0;
 9565-	}
 9566-
 9567-	if ((split_start >= resize->splits) || (split_start < 0) ||
 9568-	    ((split_start + split_count) > resize->splits) || (split_count <= 0)) {
 9569-		return 0;
 9570-	}
 9571-
 9572-	// do resize
 9573-	return stbir__perform_resize(resize->samplers, split_start, split_count);
 9574-}
 9575-
 9576-static void *
 9577-stbir_quick_resize_helper(const void *input_pixels, int input_w, int input_h,
 9578-                          int input_stride_in_bytes, void *output_pixels,
 9579-                          int output_w, int output_h,
 9580-                          int output_stride_in_bytes,
 9581-                          stbir_pixel_layout pixel_layout,
 9582-                          stbir_datatype data_type, stbir_edge edge,
 9583-                          stbir_filter filter)
 9584-{
 9585-	STBIR_RESIZE resize;
 9586-	int scanline_output_in_bytes;
 9587-	int positive_output_stride_in_bytes;
 9588-	void *start_ptr;
 9589-	void *free_ptr;
 9590-
 9591-	scanline_output_in_bytes =
 9592-	    output_w * stbir__type_size[data_type] *
 9593-	    stbir__pixel_channels
 9594-	        [stbir__pixel_layout_convert_public_to_internal[pixel_layout]];
 9595-	if (scanline_output_in_bytes == 0) {
 9596-		return 0;
 9597-	}
 9598-
 9599-	// if zero stride, use scanline output
 9600-	if (output_stride_in_bytes == 0) {
 9601-		output_stride_in_bytes = scanline_output_in_bytes;
 9602-	}
 9603-
 9604-	// abs value for inverted images (negative pitches)
 9605-	positive_output_stride_in_bytes = output_stride_in_bytes;
 9606-	if (positive_output_stride_in_bytes < 0) {
 9607-		positive_output_stride_in_bytes = -positive_output_stride_in_bytes;
 9608-	}
 9609-
 9610-	// is the requested stride smaller than the scanline output? if so, just
 9611-	// fail
 9612-	if (positive_output_stride_in_bytes < scanline_output_in_bytes) {
 9613-		return 0;
 9614-	}
 9615-
 9616-	start_ptr = output_pixels;
 9617-	free_ptr = 0; // no free pointer, since they passed buffer to use
 9618-
 9619-	// did they pass a zero for the dest? if so, allocate the buffer
 9620-	if (output_pixels == 0) {
 9621-		size_t size;
 9622-		char *ptr;
 9623-
 9624-		size = (size_t)positive_output_stride_in_bytes * (size_t)output_h;
 9625-		if (size == 0) {
 9626-			return 0;
 9627-		}
 9628-
 9629-		ptr = (char *)STBIR_MALLOC(size, 0);
 9630-		if (ptr == 0) {
 9631-			return 0;
 9632-		}
 9633-
 9634-		free_ptr = ptr;
 9635-
 9636-		// point at the last scanline, if they requested a flipped image
 9637-		if (output_stride_in_bytes < 0) {
 9638-			start_ptr = ptr + ((size_t)positive_output_stride_in_bytes *
 9639-			                   (size_t)(output_h - 1));
 9640-		} else {
 9641-			start_ptr = ptr;
 9642-		}
 9643-	}
 9644-
 9645-	// ok, now do the resize
 9646-	stbir_resize_init(&resize, input_pixels, input_w, input_h,
 9647-	                  input_stride_in_bytes, start_ptr, output_w, output_h,
 9648-	                  output_stride_in_bytes, pixel_layout, data_type);
 9649-
 9650-	resize.horizontal_edge = edge;
 9651-	resize.vertical_edge = edge;
 9652-	resize.horizontal_filter = filter;
 9653-	resize.vertical_filter = filter;
 9654-
 9655-	if (!stbir_resize_extended(&resize)) {
 9656-		if (free_ptr) {
 9657-			STBIR_FREE(free_ptr, 0);
 9658-		}
 9659-		return 0;
 9660-	}
 9661-
 9662-	return (free_ptr) ? free_ptr : start_ptr;
 9663-}
 9664-
 9665-STBIRDEF unsigned char *
 9666-stbir_resize_uint8_linear(const unsigned char *input_pixels, int input_w,
 9667-                          int input_h, int input_stride_in_bytes,
 9668-                          unsigned char *output_pixels, int output_w,
 9669-                          int output_h, int output_stride_in_bytes,
 9670-                          stbir_pixel_layout pixel_layout)
 9671-{
 9672-	return (unsigned char *)stbir_quick_resize_helper(
 9673-	    input_pixels, input_w, input_h, input_stride_in_bytes, output_pixels,
 9674-	    output_w, output_h, output_stride_in_bytes, pixel_layout,
 9675-	    STBIR_TYPE_UINT8, STBIR_EDGE_CLAMP, STBIR_FILTER_DEFAULT);
 9676-}
 9677-
 9678-STBIRDEF unsigned char *
 9679-stbir_resize_uint8_srgb(const unsigned char *input_pixels, int input_w,
 9680-                        int input_h, int input_stride_in_bytes,
 9681-                        unsigned char *output_pixels, int output_w,
 9682-                        int output_h, int output_stride_in_bytes,
 9683-                        stbir_pixel_layout pixel_layout)
 9684-{
 9685-	return (unsigned char *)stbir_quick_resize_helper(
 9686-	    input_pixels, input_w, input_h, input_stride_in_bytes, output_pixels,
 9687-	    output_w, output_h, output_stride_in_bytes, pixel_layout,
 9688-	    STBIR_TYPE_UINT8_SRGB, STBIR_EDGE_CLAMP, STBIR_FILTER_DEFAULT);
 9689-}
 9690-
 9691-STBIRDEF float *
 9692-stbir_resize_float_linear(const float *input_pixels, int input_w, int input_h,
 9693-                          int input_stride_in_bytes, float *output_pixels,
 9694-                          int output_w, int output_h,
 9695-                          int output_stride_in_bytes,
 9696-                          stbir_pixel_layout pixel_layout)
 9697-{
 9698-	return (float *)stbir_quick_resize_helper(
 9699-	    input_pixels, input_w, input_h, input_stride_in_bytes, output_pixels,
 9700-	    output_w, output_h, output_stride_in_bytes, pixel_layout,
 9701-	    STBIR_TYPE_FLOAT, STBIR_EDGE_CLAMP, STBIR_FILTER_DEFAULT);
 9702-}
 9703-
 9704-STBIRDEF void *
 9705-stbir_resize(const void *input_pixels, int input_w, int input_h,
 9706-             int input_stride_in_bytes, void *output_pixels, int output_w,
 9707-             int output_h, int output_stride_in_bytes,
 9708-             stbir_pixel_layout pixel_layout, stbir_datatype data_type,
 9709-             stbir_edge edge, stbir_filter filter)
 9710-{
 9711-	return (void *)stbir_quick_resize_helper(
 9712-	    input_pixels, input_w, input_h, input_stride_in_bytes, output_pixels,
 9713-	    output_w, output_h, output_stride_in_bytes, pixel_layout, data_type,
 9714-	    edge, filter);
 9715-}
 9716-
 9717-#ifdef STBIR_PROFILE
 9718-
 9719-STBIRDEF void
 9720-stbir_resize_build_profile_info(STBIR_PROFILE_INFO *info,
 9721-                                STBIR_RESIZE const *resize)
 9722-{
 9723-	static char const *bdescriptions[6] = {
 9724-	    "Building",         "Allocating",          "Horizontal sampler",
 9725-	    "Vertical sampler", "Coefficient cleanup", "Coefficient piovot"};
 9726-	stbir__info *samp = resize->samplers;
 9727-	int i;
 9728-
 9729-	typedef int testa[(STBIR__ARRAY_SIZE(bdescriptions) ==
 9730-	                   (STBIR__ARRAY_SIZE(samp->profile.array) - 1))
 9731-	                      ? 1
 9732-	                      : -1];
 9733-	typedef int
 9734-	    testb[(sizeof(samp->profile.array) == (sizeof(samp->profile.named)))
 9735-	              ? 1
 9736-	              : -1];
 9737-	typedef int
 9738-	    testc[(sizeof(info->clocks) >= (sizeof(samp->profile.named))) ? 1 : -1];
 9739-
 9740-	for (i = 0; i < STBIR__ARRAY_SIZE(bdescriptions); i++) {
 9741-		info->clocks[i] = samp->profile.array[i + 1];
 9742-	}
 9743-
 9744-	info->total_clocks = samp->profile.named.total;
 9745-	info->descriptions = bdescriptions;
 9746-	info->count = STBIR__ARRAY_SIZE(bdescriptions);
 9747-}
 9748-
 9749-STBIRDEF void
 9750-stbir_resize_split_profile_info(STBIR_PROFILE_INFO *info,
 9751-                                STBIR_RESIZE const *resize, int split_start,
 9752-                                int split_count)
 9753-{
 9754-	static char const *descriptions[7] = {
 9755-	    "Looping",          "Vertical sampling", "Horizontal sampling",
 9756-	    "Scanline input",   "Scanline output",   "Alpha weighting",
 9757-	    "Alpha unweighting"};
 9758-	stbir__per_split_info *split_info;
 9759-	int s, i;
 9760-
 9761-	typedef int testa[(STBIR__ARRAY_SIZE(descriptions) ==
 9762-	                   (STBIR__ARRAY_SIZE(split_info->profile.array) - 1))
 9763-	                      ? 1
 9764-	                      : -1];
 9765-	typedef int testb[(sizeof(split_info->profile.array) ==
 9766-	                   (sizeof(split_info->profile.named)))
 9767-	                      ? 1
 9768-	                      : -1];
 9769-	typedef int
 9770-	    testc[(sizeof(info->clocks) >= (sizeof(split_info->profile.named)))
 9771-	              ? 1
 9772-	              : -1];
 9773-
 9774-	if (split_start == -1) {
 9775-		split_start = 0;
 9776-		split_count = resize->samplers->splits;
 9777-	}
 9778-
 9779-	if ((split_start >= resize->splits) || (split_start < 0) ||
 9780-	    ((split_start + split_count) > resize->splits) || (split_count <= 0)) {
 9781-		info->total_clocks = 0;
 9782-		info->descriptions = 0;
 9783-		info->count = 0;
 9784-		return;
 9785-	}
 9786-
 9787-	split_info = resize->samplers->split_info + split_start;
 9788-
 9789-	// sum up the profile from all the splits
 9790-	for (i = 0; i < STBIR__ARRAY_SIZE(descriptions); i++) {
 9791-		stbir_uint64 sum = 0;
 9792-		for (s = 0; s < split_count; s++) {
 9793-			sum += split_info[s].profile.array[i + 1];
 9794-		}
 9795-		info->clocks[i] = sum;
 9796-	}
 9797-
 9798-	info->total_clocks = split_info->profile.named.total;
 9799-	info->descriptions = descriptions;
 9800-	info->count = STBIR__ARRAY_SIZE(descriptions);
 9801-}
 9802-
 9803-STBIRDEF void
 9804-stbir_resize_extended_profile_info(STBIR_PROFILE_INFO *info,
 9805-                                   STBIR_RESIZE const *resize)
 9806-{
 9807-	stbir_resize_split_profile_info(info, resize, -1, 0);
 9808-}
 9809-
 9810-#endif // STBIR_PROFILE
 9811-
 9812-#undef STBIR_BGR
 9813-#undef STBIR_1CHANNEL
 9814-#undef STBIR_2CHANNEL
 9815-#undef STBIR_RGB
 9816-#undef STBIR_RGBA
 9817-#undef STBIR_4CHANNEL
 9818-#undef STBIR_BGRA
 9819-#undef STBIR_ARGB
 9820-#undef STBIR_ABGR
 9821-#undef STBIR_RA
 9822-#undef STBIR_AR
 9823-#undef STBIR_RGBA_PM
 9824-#undef STBIR_BGRA_PM
 9825-#undef STBIR_ARGB_PM
 9826-#undef STBIR_ABGR_PM
 9827-#undef STBIR_RA_PM
 9828-#undef STBIR_AR_PM
 9829-
 9830-#endif // STB_IMAGE_RESIZE_IMPLEMENTATION
 9831-
 9832-#else // STB_IMAGE_RESIZE_HORIZONTALS&STB_IMAGE_RESIZE_DO_VERTICALS
 9833-
 9834-// we reinclude the header file to define all the horizontal functions
 9835-//   specializing each function for the number of coeffs is 20-40% faster
 9836-//   *OVERALL*
 9837-
 9838-// by including the header file again this way, we can still debug the functions
 9839-
 9840-#define STBIR_strs_join2(start, mid, end) start##mid##end
 9841-#define STBIR_strs_join1(start, mid, end) STBIR_strs_join2(start, mid, end)
 9842-
 9843-#define STBIR_strs_join24(start, mid1, mid2, end) start##mid1##mid2##end
 9844-#define STBIR_strs_join14(start, mid1, mid2, end)                              \
 9845-	STBIR_strs_join24(start, mid1, mid2, end)
 9846-
 9847-#ifdef STB_IMAGE_RESIZE_DO_CODERS
 9848-
 9849-#ifdef stbir__decode_suffix
 9850-#define STBIR__CODER_NAME(name) STBIR_strs_join1(name, _, stbir__decode_suffix)
 9851-#else
 9852-#define STBIR__CODER_NAME(name) name
 9853-#endif
 9854-
 9855-#ifdef stbir__decode_swizzle
 9856-#define stbir__decode_simdf8_flip(reg)                                         \
 9857-	STBIR_strs_join1(                                                          \
 9858-	    STBIR_strs_join1(                                                      \
 9859-	        STBIR_strs_join1(STBIR_strs_join1(stbir__simdf8_0123to,            \
 9860-	                                          stbir__decode_order0,            \
 9861-	                                          stbir__decode_order1),           \
 9862-	                         stbir__decode_order2, stbir__decode_order3),      \
 9863-	        stbir__decode_order0, stbir__decode_order1),                       \
 9864-	    stbir__decode_order2, stbir__decode_order3)(reg, reg)
 9865-#define stbir__decode_simdf4_flip(reg)                                         \
 9866-	STBIR_strs_join1(STBIR_strs_join1(stbir__simdf_0123to,                     \
 9867-	                                  stbir__decode_order0,                    \
 9868-	                                  stbir__decode_order1),                   \
 9869-	                 stbir__decode_order2, stbir__decode_order3)(reg, reg)
 9870-#define stbir__encode_simdf8_unflip(reg)                                       \
 9871-	STBIR_strs_join1(                                                          \
 9872-	    STBIR_strs_join1(                                                      \
 9873-	        STBIR_strs_join1(STBIR_strs_join1(stbir__simdf8_0123to,            \
 9874-	                                          stbir__encode_order0,            \
 9875-	                                          stbir__encode_order1),           \
 9876-	                         stbir__encode_order2, stbir__encode_order3),      \
 9877-	        stbir__encode_order0, stbir__encode_order1),                       \
 9878-	    stbir__encode_order2, stbir__encode_order3)(reg, reg)
 9879-#define stbir__encode_simdf4_unflip(reg)                                       \
 9880-	STBIR_strs_join1(STBIR_strs_join1(stbir__simdf_0123to,                     \
 9881-	                                  stbir__encode_order0,                    \
 9882-	                                  stbir__encode_order1),                   \
 9883-	                 stbir__encode_order2, stbir__encode_order3)(reg, reg)
 9884-#else
 9885-#define stbir__decode_order0 0
 9886-#define stbir__decode_order1 1
 9887-#define stbir__decode_order2 2
 9888-#define stbir__decode_order3 3
 9889-#define stbir__encode_order0 0
 9890-#define stbir__encode_order1 1
 9891-#define stbir__encode_order2 2
 9892-#define stbir__encode_order3 3
 9893-#define stbir__decode_simdf8_flip(reg)
 9894-#define stbir__decode_simdf4_flip(reg)
 9895-#define stbir__encode_simdf8_unflip(reg)
 9896-#define stbir__encode_simdf4_unflip(reg)
 9897-#endif
 9898-
 9899-#ifdef STBIR_SIMD8
 9900-#define stbir__encode_simdfX_unflip stbir__encode_simdf8_unflip
 9901-#else
 9902-#define stbir__encode_simdfX_unflip stbir__encode_simdf4_unflip
 9903-#endif
 9904-
 9905-static float *
 9906-STBIR__CODER_NAME(stbir__decode_uint8_linear_scaled)(float *decodep,
 9907-                                                     int width_times_channels,
 9908-                                                     void const *inputp)
 9909-{
 9910-	float STBIR_STREAMOUT_PTR(*) decode = decodep;
 9911-	float *decode_end = (float *)decode + width_times_channels;
 9912-	unsigned char const *input = (unsigned char const *)inputp;
 9913-
 9914-#ifdef STBIR_SIMD
 9915-	unsigned char const *end_input_m16 = input + width_times_channels - 16;
 9916-	if (width_times_channels >= 16) {
 9917-		decode_end -= 16;
 9918-		STBIR_NO_UNROLL_LOOP_START_INF_FOR
 9919-		for (;;) {
 9920-#ifdef STBIR_SIMD8
 9921-			stbir__simdi i;
 9922-			stbir__simdi8 o0, o1;
 9923-			stbir__simdf8 of0, of1;
 9924-			STBIR_NO_UNROLL(decode);
 9925-			stbir__simdi_load(i, input);
 9926-			stbir__simdi8_expand_u8_to_u32(o0, o1, i);
 9927-			stbir__simdi8_convert_i32_to_float(of0, o0);
 9928-			stbir__simdi8_convert_i32_to_float(of1, o1);
 9929-			stbir__simdf8_mult(of0, of0, STBIR_max_uint8_as_float_inverted8);
 9930-			stbir__simdf8_mult(of1, of1, STBIR_max_uint8_as_float_inverted8);
 9931-			stbir__decode_simdf8_flip(of0);
 9932-			stbir__decode_simdf8_flip(of1);
 9933-			stbir__simdf8_store(decode + 0, of0);
 9934-			stbir__simdf8_store(decode + 8, of1);
 9935-#else
 9936-			stbir__simdi i, o0, o1, o2, o3;
 9937-			stbir__simdf of0, of1, of2, of3;
 9938-			STBIR_NO_UNROLL(decode);
 9939-			stbir__simdi_load(i, input);
 9940-			stbir__simdi_expand_u8_to_u32(o0, o1, o2, o3, i);
 9941-			stbir__simdi_convert_i32_to_float(of0, o0);
 9942-			stbir__simdi_convert_i32_to_float(of1, o1);
 9943-			stbir__simdi_convert_i32_to_float(of2, o2);
 9944-			stbir__simdi_convert_i32_to_float(of3, o3);
 9945-			stbir__simdf_mult(of0, of0,
 9946-			                  STBIR__CONSTF(STBIR_max_uint8_as_float_inverted));
 9947-			stbir__simdf_mult(of1, of1,
 9948-			                  STBIR__CONSTF(STBIR_max_uint8_as_float_inverted));
 9949-			stbir__simdf_mult(of2, of2,
 9950-			                  STBIR__CONSTF(STBIR_max_uint8_as_float_inverted));
 9951-			stbir__simdf_mult(of3, of3,
 9952-			                  STBIR__CONSTF(STBIR_max_uint8_as_float_inverted));
 9953-			stbir__decode_simdf4_flip(of0);
 9954-			stbir__decode_simdf4_flip(of1);
 9955-			stbir__decode_simdf4_flip(of2);
 9956-			stbir__decode_simdf4_flip(of3);
 9957-			stbir__simdf_store(decode + 0, of0);
 9958-			stbir__simdf_store(decode + 4, of1);
 9959-			stbir__simdf_store(decode + 8, of2);
 9960-			stbir__simdf_store(decode + 12, of3);
 9961-#endif
 9962-			decode += 16;
 9963-			input += 16;
 9964-			if (decode <= decode_end) {
 9965-				continue;
 9966-			}
 9967-			if (decode == (decode_end + 16)) {
 9968-				break;
 9969-			}
 9970-			decode = decode_end; // backup and do last couple
 9971-			input = end_input_m16;
 9972-		}
 9973-		return decode_end + 16;
 9974-	}
 9975-#endif
 9976-
 9977-// try to do blocks of 4 when you can
 9978-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
 9979-	decode += 4;
 9980-	STBIR_SIMD_NO_UNROLL_LOOP_START
 9981-	while (decode <= decode_end) {
 9982-		STBIR_SIMD_NO_UNROLL(decode);
 9983-		decode[0 - 4] = ((float)(input[stbir__decode_order0])) *
 9984-		                stbir__max_uint8_as_float_inverted;
 9985-		decode[1 - 4] = ((float)(input[stbir__decode_order1])) *
 9986-		                stbir__max_uint8_as_float_inverted;
 9987-		decode[2 - 4] = ((float)(input[stbir__decode_order2])) *
 9988-		                stbir__max_uint8_as_float_inverted;
 9989-		decode[3 - 4] = ((float)(input[stbir__decode_order3])) *
 9990-		                stbir__max_uint8_as_float_inverted;
 9991-		decode += 4;
 9992-		input += 4;
 9993-	}
 9994-	decode -= 4;
 9995-#endif
 9996-
 9997-// do the remnants
 9998-#if stbir__coder_min_num < 4
 9999-	STBIR_NO_UNROLL_LOOP_START
10000-	while (decode < decode_end) {
10001-		STBIR_NO_UNROLL(decode);
10002-		decode[0] = ((float)(input[stbir__decode_order0])) *
10003-		            stbir__max_uint8_as_float_inverted;
10004-#if stbir__coder_min_num >= 2
10005-		decode[1] = ((float)(input[stbir__decode_order1])) *
10006-		            stbir__max_uint8_as_float_inverted;
10007-#endif
10008-#if stbir__coder_min_num >= 3
10009-		decode[2] = ((float)(input[stbir__decode_order2])) *
10010-		            stbir__max_uint8_as_float_inverted;
10011-#endif
10012-		decode += stbir__coder_min_num;
10013-		input += stbir__coder_min_num;
10014-	}
10015-#endif
10016-
10017-	return decode_end;
10018-}
10019-
10020-static void
10021-STBIR__CODER_NAME(stbir__encode_uint8_linear_scaled)(void *outputp,
10022-                                                     int width_times_channels,
10023-                                                     float const *encode)
10024-{
10025-	unsigned char STBIR_SIMD_STREAMOUT_PTR(*) output = (unsigned char *)outputp;
10026-	unsigned char *end_output =
10027-	    ((unsigned char *)output) + width_times_channels;
10028-
10029-#ifdef STBIR_SIMD
10030-	if (width_times_channels >= stbir__simdfX_float_count * 2) {
10031-		float const *end_encode_m8 =
10032-		    encode + width_times_channels - stbir__simdfX_float_count * 2;
10033-		end_output -= stbir__simdfX_float_count * 2;
10034-		STBIR_NO_UNROLL_LOOP_START_INF_FOR
10035-		for (;;) {
10036-			stbir__simdfX e0, e1;
10037-			stbir__simdi i;
10038-			STBIR_SIMD_NO_UNROLL(encode);
10039-			stbir__simdfX_madd_mem(e0, STBIR_simd_point5X,
10040-			                       STBIR_max_uint8_as_floatX, encode);
10041-			stbir__simdfX_madd_mem(e1, STBIR_simd_point5X,
10042-			                       STBIR_max_uint8_as_floatX,
10043-			                       encode + stbir__simdfX_float_count);
10044-			stbir__encode_simdfX_unflip(e0);
10045-			stbir__encode_simdfX_unflip(e1);
10046-#ifdef STBIR_SIMD8
10047-			stbir__simdf8_pack_to_16bytes(i, e0, e1);
10048-			stbir__simdi_store(output, i);
10049-#else
10050-			stbir__simdf_pack_to_8bytes(i, e0, e1);
10051-			stbir__simdi_store2(output, i);
10052-#endif
10053-			encode += stbir__simdfX_float_count * 2;
10054-			output += stbir__simdfX_float_count * 2;
10055-			if (output <= end_output) {
10056-				continue;
10057-			}
10058-			if (output == (end_output + stbir__simdfX_float_count * 2)) {
10059-				break;
10060-			}
10061-			output = end_output; // backup and do last couple
10062-			encode = end_encode_m8;
10063-		}
10064-		return;
10065-	}
10066-
10067-// try to do blocks of 4 when you can
10068-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
10069-	output += 4;
10070-	STBIR_NO_UNROLL_LOOP_START
10071-	while (output <= end_output) {
10072-		stbir__simdf e0;
10073-		stbir__simdi i0;
10074-		STBIR_NO_UNROLL(encode);
10075-		stbir__simdf_load(e0, encode);
10076-		stbir__simdf_madd(e0, STBIR__CONSTF(STBIR_simd_point5),
10077-		                  STBIR__CONSTF(STBIR_max_uint8_as_float), e0);
10078-		stbir__encode_simdf4_unflip(e0);
10079-		stbir__simdf_pack_to_8bytes(i0, e0, e0); // only use first 4
10080-		*(int *)(output - 4) = stbir__simdi_to_int(i0);
10081-		output += 4;
10082-		encode += 4;
10083-	}
10084-	output -= 4;
10085-#endif
10086-
10087-// do the remnants
10088-#if stbir__coder_min_num < 4
10089-	STBIR_NO_UNROLL_LOOP_START
10090-	while (output < end_output) {
10091-		stbir__simdf e0;
10092-		STBIR_NO_UNROLL(encode);
10093-		stbir__simdf_madd1_mem(e0, STBIR__CONSTF(STBIR_simd_point5),
10094-		                       STBIR__CONSTF(STBIR_max_uint8_as_float),
10095-		                       encode + stbir__encode_order0);
10096-		output[0] = stbir__simdf_convert_float_to_uint8(e0);
10097-#if stbir__coder_min_num >= 2
10098-		stbir__simdf_madd1_mem(e0, STBIR__CONSTF(STBIR_simd_point5),
10099-		                       STBIR__CONSTF(STBIR_max_uint8_as_float),
10100-		                       encode + stbir__encode_order1);
10101-		output[1] = stbir__simdf_convert_float_to_uint8(e0);
10102-#endif
10103-#if stbir__coder_min_num >= 3
10104-		stbir__simdf_madd1_mem(e0, STBIR__CONSTF(STBIR_simd_point5),
10105-		                       STBIR__CONSTF(STBIR_max_uint8_as_float),
10106-		                       encode + stbir__encode_order2);
10107-		output[2] = stbir__simdf_convert_float_to_uint8(e0);
10108-#endif
10109-		output += stbir__coder_min_num;
10110-		encode += stbir__coder_min_num;
10111-	}
10112-#endif
10113-
10114-#else
10115-
10116-// try to do blocks of 4 when you can
10117-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
10118-	output += 4;
10119-	while (output <= end_output) {
10120-		float f;
10121-		f = encode[stbir__encode_order0] * stbir__max_uint8_as_float + 0.5f;
10122-		STBIR_CLAMP(f, 0, 255);
10123-		output[0 - 4] = (unsigned char)f;
10124-		f = encode[stbir__encode_order1] * stbir__max_uint8_as_float + 0.5f;
10125-		STBIR_CLAMP(f, 0, 255);
10126-		output[1 - 4] = (unsigned char)f;
10127-		f = encode[stbir__encode_order2] * stbir__max_uint8_as_float + 0.5f;
10128-		STBIR_CLAMP(f, 0, 255);
10129-		output[2 - 4] = (unsigned char)f;
10130-		f = encode[stbir__encode_order3] * stbir__max_uint8_as_float + 0.5f;
10131-		STBIR_CLAMP(f, 0, 255);
10132-		output[3 - 4] = (unsigned char)f;
10133-		output += 4;
10134-		encode += 4;
10135-	}
10136-	output -= 4;
10137-#endif
10138-
10139-// do the remnants
10140-#if stbir__coder_min_num < 4
10141-	STBIR_NO_UNROLL_LOOP_START
10142-	while (output < end_output) {
10143-		float f;
10144-		STBIR_NO_UNROLL(encode);
10145-		f = encode[stbir__encode_order0] * stbir__max_uint8_as_float + 0.5f;
10146-		STBIR_CLAMP(f, 0, 255);
10147-		output[0] = (unsigned char)f;
10148-#if stbir__coder_min_num >= 2
10149-		f = encode[stbir__encode_order1] * stbir__max_uint8_as_float + 0.5f;
10150-		STBIR_CLAMP(f, 0, 255);
10151-		output[1] = (unsigned char)f;
10152-#endif
10153-#if stbir__coder_min_num >= 3
10154-		f = encode[stbir__encode_order2] * stbir__max_uint8_as_float + 0.5f;
10155-		STBIR_CLAMP(f, 0, 255);
10156-		output[2] = (unsigned char)f;
10157-#endif
10158-		output += stbir__coder_min_num;
10159-		encode += stbir__coder_min_num;
10160-	}
10161-#endif
10162-#endif
10163-}
10164-
10165-static float *
10166-STBIR__CODER_NAME(stbir__decode_uint8_linear)(float *decodep,
10167-                                              int width_times_channels,
10168-                                              void const *inputp)
10169-{
10170-	float STBIR_STREAMOUT_PTR(*) decode = decodep;
10171-	float *decode_end = (float *)decode + width_times_channels;
10172-	unsigned char const *input = (unsigned char const *)inputp;
10173-
10174-#ifdef STBIR_SIMD
10175-	unsigned char const *end_input_m16 = input + width_times_channels - 16;
10176-	if (width_times_channels >= 16) {
10177-		decode_end -= 16;
10178-		STBIR_NO_UNROLL_LOOP_START_INF_FOR
10179-		for (;;) {
10180-#ifdef STBIR_SIMD8
10181-			stbir__simdi i;
10182-			stbir__simdi8 o0, o1;
10183-			stbir__simdf8 of0, of1;
10184-			STBIR_NO_UNROLL(decode);
10185-			stbir__simdi_load(i, input);
10186-			stbir__simdi8_expand_u8_to_u32(o0, o1, i);
10187-			stbir__simdi8_convert_i32_to_float(of0, o0);
10188-			stbir__simdi8_convert_i32_to_float(of1, o1);
10189-			stbir__decode_simdf8_flip(of0);
10190-			stbir__decode_simdf8_flip(of1);
10191-			stbir__simdf8_store(decode + 0, of0);
10192-			stbir__simdf8_store(decode + 8, of1);
10193-#else
10194-			stbir__simdi i, o0, o1, o2, o3;
10195-			stbir__simdf of0, of1, of2, of3;
10196-			STBIR_NO_UNROLL(decode);
10197-			stbir__simdi_load(i, input);
10198-			stbir__simdi_expand_u8_to_u32(o0, o1, o2, o3, i);
10199-			stbir__simdi_convert_i32_to_float(of0, o0);
10200-			stbir__simdi_convert_i32_to_float(of1, o1);
10201-			stbir__simdi_convert_i32_to_float(of2, o2);
10202-			stbir__simdi_convert_i32_to_float(of3, o3);
10203-			stbir__decode_simdf4_flip(of0);
10204-			stbir__decode_simdf4_flip(of1);
10205-			stbir__decode_simdf4_flip(of2);
10206-			stbir__decode_simdf4_flip(of3);
10207-			stbir__simdf_store(decode + 0, of0);
10208-			stbir__simdf_store(decode + 4, of1);
10209-			stbir__simdf_store(decode + 8, of2);
10210-			stbir__simdf_store(decode + 12, of3);
10211-#endif
10212-			decode += 16;
10213-			input += 16;
10214-			if (decode <= decode_end) {
10215-				continue;
10216-			}
10217-			if (decode == (decode_end + 16)) {
10218-				break;
10219-			}
10220-			decode = decode_end; // backup and do last couple
10221-			input = end_input_m16;
10222-		}
10223-		return decode_end + 16;
10224-	}
10225-#endif
10226-
10227-// try to do blocks of 4 when you can
10228-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
10229-	decode += 4;
10230-	STBIR_SIMD_NO_UNROLL_LOOP_START
10231-	while (decode <= decode_end) {
10232-		STBIR_SIMD_NO_UNROLL(decode);
10233-		decode[0 - 4] = ((float)(input[stbir__decode_order0]));
10234-		decode[1 - 4] = ((float)(input[stbir__decode_order1]));
10235-		decode[2 - 4] = ((float)(input[stbir__decode_order2]));
10236-		decode[3 - 4] = ((float)(input[stbir__decode_order3]));
10237-		decode += 4;
10238-		input += 4;
10239-	}
10240-	decode -= 4;
10241-#endif
10242-
10243-// do the remnants
10244-#if stbir__coder_min_num < 4
10245-	STBIR_NO_UNROLL_LOOP_START
10246-	while (decode < decode_end) {
10247-		STBIR_NO_UNROLL(decode);
10248-		decode[0] = ((float)(input[stbir__decode_order0]));
10249-#if stbir__coder_min_num >= 2
10250-		decode[1] = ((float)(input[stbir__decode_order1]));
10251-#endif
10252-#if stbir__coder_min_num >= 3
10253-		decode[2] = ((float)(input[stbir__decode_order2]));
10254-#endif
10255-		decode += stbir__coder_min_num;
10256-		input += stbir__coder_min_num;
10257-	}
10258-#endif
10259-	return decode_end;
10260-}
10261-
10262-static void
10263-STBIR__CODER_NAME(stbir__encode_uint8_linear)(void *outputp,
10264-                                              int width_times_channels,
10265-                                              float const *encode)
10266-{
10267-	unsigned char STBIR_SIMD_STREAMOUT_PTR(*) output = (unsigned char *)outputp;
10268-	unsigned char *end_output =
10269-	    ((unsigned char *)output) + width_times_channels;
10270-
10271-#ifdef STBIR_SIMD
10272-	if (width_times_channels >= stbir__simdfX_float_count * 2) {
10273-		float const *end_encode_m8 =
10274-		    encode + width_times_channels - stbir__simdfX_float_count * 2;
10275-		end_output -= stbir__simdfX_float_count * 2;
10276-		STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
10277-		for (;;) {
10278-			stbir__simdfX e0, e1;
10279-			stbir__simdi i;
10280-			STBIR_SIMD_NO_UNROLL(encode);
10281-			stbir__simdfX_add_mem(e0, STBIR_simd_point5X, encode);
10282-			stbir__simdfX_add_mem(e1, STBIR_simd_point5X,
10283-			                      encode + stbir__simdfX_float_count);
10284-			stbir__encode_simdfX_unflip(e0);
10285-			stbir__encode_simdfX_unflip(e1);
10286-#ifdef STBIR_SIMD8
10287-			stbir__simdf8_pack_to_16bytes(i, e0, e1);
10288-			stbir__simdi_store(output, i);
10289-#else
10290-			stbir__simdf_pack_to_8bytes(i, e0, e1);
10291-			stbir__simdi_store2(output, i);
10292-#endif
10293-			encode += stbir__simdfX_float_count * 2;
10294-			output += stbir__simdfX_float_count * 2;
10295-			if (output <= end_output) {
10296-				continue;
10297-			}
10298-			if (output == (end_output + stbir__simdfX_float_count * 2)) {
10299-				break;
10300-			}
10301-			output = end_output; // backup and do last couple
10302-			encode = end_encode_m8;
10303-		}
10304-		return;
10305-	}
10306-
10307-// try to do blocks of 4 when you can
10308-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
10309-	output += 4;
10310-	STBIR_NO_UNROLL_LOOP_START
10311-	while (output <= end_output) {
10312-		stbir__simdf e0;
10313-		stbir__simdi i0;
10314-		STBIR_NO_UNROLL(encode);
10315-		stbir__simdf_load(e0, encode);
10316-		stbir__simdf_add(e0, STBIR__CONSTF(STBIR_simd_point5), e0);
10317-		stbir__encode_simdf4_unflip(e0);
10318-		stbir__simdf_pack_to_8bytes(i0, e0, e0); // only use first 4
10319-		*(int *)(output - 4) = stbir__simdi_to_int(i0);
10320-		output += 4;
10321-		encode += 4;
10322-	}
10323-	output -= 4;
10324-#endif
10325-
10326-#else
10327-
10328-// try to do blocks of 4 when you can
10329-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
10330-	output += 4;
10331-	while (output <= end_output) {
10332-		float f;
10333-		f = encode[stbir__encode_order0] + 0.5f;
10334-		STBIR_CLAMP(f, 0, 255);
10335-		output[0 - 4] = (unsigned char)f;
10336-		f = encode[stbir__encode_order1] + 0.5f;
10337-		STBIR_CLAMP(f, 0, 255);
10338-		output[1 - 4] = (unsigned char)f;
10339-		f = encode[stbir__encode_order2] + 0.5f;
10340-		STBIR_CLAMP(f, 0, 255);
10341-		output[2 - 4] = (unsigned char)f;
10342-		f = encode[stbir__encode_order3] + 0.5f;
10343-		STBIR_CLAMP(f, 0, 255);
10344-		output[3 - 4] = (unsigned char)f;
10345-		output += 4;
10346-		encode += 4;
10347-	}
10348-	output -= 4;
10349-#endif
10350-
10351-#endif
10352-
10353-// do the remnants
10354-#if stbir__coder_min_num < 4
10355-	STBIR_NO_UNROLL_LOOP_START
10356-	while (output < end_output) {
10357-		float f;
10358-		STBIR_NO_UNROLL(encode);
10359-		f = encode[stbir__encode_order0] + 0.5f;
10360-		STBIR_CLAMP(f, 0, 255);
10361-		output[0] = (unsigned char)f;
10362-#if stbir__coder_min_num >= 2
10363-		f = encode[stbir__encode_order1] + 0.5f;
10364-		STBIR_CLAMP(f, 0, 255);
10365-		output[1] = (unsigned char)f;
10366-#endif
10367-#if stbir__coder_min_num >= 3
10368-		f = encode[stbir__encode_order2] + 0.5f;
10369-		STBIR_CLAMP(f, 0, 255);
10370-		output[2] = (unsigned char)f;
10371-#endif
10372-		output += stbir__coder_min_num;
10373-		encode += stbir__coder_min_num;
10374-	}
10375-#endif
10376-}
10377-
10378-static float *
10379-STBIR__CODER_NAME(stbir__decode_uint8_srgb)(float *decodep,
10380-                                            int width_times_channels,
10381-                                            void const *inputp)
10382-{
10383-	float STBIR_STREAMOUT_PTR(*) decode = decodep;
10384-	float *decode_end = (float *)decode + width_times_channels;
10385-	unsigned char const *input = (unsigned char const *)inputp;
10386-
10387-// try to do blocks of 4 when you can
10388-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
10389-	decode += 4;
10390-	while (decode <= decode_end) {
10391-		decode[0 - 4] =
10392-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order0]];
10393-		decode[1 - 4] =
10394-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order1]];
10395-		decode[2 - 4] =
10396-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order2]];
10397-		decode[3 - 4] =
10398-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order3]];
10399-		decode += 4;
10400-		input += 4;
10401-	}
10402-	decode -= 4;
10403-#endif
10404-
10405-// do the remnants
10406-#if stbir__coder_min_num < 4
10407-	STBIR_NO_UNROLL_LOOP_START
10408-	while (decode < decode_end) {
10409-		STBIR_NO_UNROLL(decode);
10410-		decode[0] =
10411-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order0]];
10412-#if stbir__coder_min_num >= 2
10413-		decode[1] =
10414-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order1]];
10415-#endif
10416-#if stbir__coder_min_num >= 3
10417-		decode[2] =
10418-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order2]];
10419-#endif
10420-		decode += stbir__coder_min_num;
10421-		input += stbir__coder_min_num;
10422-	}
10423-#endif
10424-	return decode_end;
10425-}
10426-
10427-#define stbir__min_max_shift20(i, f)                                           \
10428-	stbir__simdf_max(f, f,                                                     \
10429-	                 stbir_simdf_casti(STBIR__CONSTI(STBIR_almost_zero)));     \
10430-	stbir__simdf_min(f, f,                                                     \
10431-	                 stbir_simdf_casti(STBIR__CONSTI(STBIR_almost_one)));      \
10432-	stbir__simdi_32shr(i, stbir_simdi_castf(f), 20);
10433-
10434-#define stbir__scale_and_convert(i, f)                                         \
10435-	stbir__simdf_madd(f, STBIR__CONSTF(STBIR_simd_point5),                     \
10436-	                  STBIR__CONSTF(STBIR_max_uint8_as_float), f);             \
10437-	stbir__simdf_max(f, f, stbir__simdf_zeroP());                              \
10438-	stbir__simdf_min(f, f, STBIR__CONSTF(STBIR_max_uint8_as_float));           \
10439-	stbir__simdf_convert_float_to_i32(i, f);
10440-
10441-#define stbir__linear_to_srgb_finish(i, f)                                     \
10442-	{                                                                          \
10443-		stbir__simdi temp;                                                     \
10444-		stbir__simdi_32shr(temp, stbir_simdi_castf(f), 12);                    \
10445-		stbir__simdi_and(temp, temp, STBIR__CONSTI(STBIR_mastissa_mask));      \
10446-		stbir__simdi_or(temp, temp, STBIR__CONSTI(STBIR_topscale));            \
10447-		stbir__simdi_16madd(i, i, temp);                                       \
10448-		stbir__simdi_32shr(i, i, 16);                                          \
10449-	}
10450-
10451-#define stbir__simdi_table_lookup2(v0, v1, table)                              \
10452-	{                                                                          \
10453-		stbir__simdi_u32 temp0, temp1;                                         \
10454-		temp0.m128i_i128 = v0;                                                 \
10455-		temp1.m128i_i128 = v1;                                                 \
10456-		temp0.m128i_u32[0] = table[temp0.m128i_i32[0]];                        \
10457-		temp0.m128i_u32[1] = table[temp0.m128i_i32[1]];                        \
10458-		temp0.m128i_u32[2] = table[temp0.m128i_i32[2]];                        \
10459-		temp0.m128i_u32[3] = table[temp0.m128i_i32[3]];                        \
10460-		temp1.m128i_u32[0] = table[temp1.m128i_i32[0]];                        \
10461-		temp1.m128i_u32[1] = table[temp1.m128i_i32[1]];                        \
10462-		temp1.m128i_u32[2] = table[temp1.m128i_i32[2]];                        \
10463-		temp1.m128i_u32[3] = table[temp1.m128i_i32[3]];                        \
10464-		v0 = temp0.m128i_i128;                                                 \
10465-		v1 = temp1.m128i_i128;                                                 \
10466-	}
10467-
10468-#define stbir__simdi_table_lookup3(v0, v1, v2, table)                          \
10469-	{                                                                          \
10470-		stbir__simdi_u32 temp0, temp1, temp2;                                  \
10471-		temp0.m128i_i128 = v0;                                                 \
10472-		temp1.m128i_i128 = v1;                                                 \
10473-		temp2.m128i_i128 = v2;                                                 \
10474-		temp0.m128i_u32[0] = table[temp0.m128i_i32[0]];                        \
10475-		temp0.m128i_u32[1] = table[temp0.m128i_i32[1]];                        \
10476-		temp0.m128i_u32[2] = table[temp0.m128i_i32[2]];                        \
10477-		temp0.m128i_u32[3] = table[temp0.m128i_i32[3]];                        \
10478-		temp1.m128i_u32[0] = table[temp1.m128i_i32[0]];                        \
10479-		temp1.m128i_u32[1] = table[temp1.m128i_i32[1]];                        \
10480-		temp1.m128i_u32[2] = table[temp1.m128i_i32[2]];                        \
10481-		temp1.m128i_u32[3] = table[temp1.m128i_i32[3]];                        \
10482-		temp2.m128i_u32[0] = table[temp2.m128i_i32[0]];                        \
10483-		temp2.m128i_u32[1] = table[temp2.m128i_i32[1]];                        \
10484-		temp2.m128i_u32[2] = table[temp2.m128i_i32[2]];                        \
10485-		temp2.m128i_u32[3] = table[temp2.m128i_i32[3]];                        \
10486-		v0 = temp0.m128i_i128;                                                 \
10487-		v1 = temp1.m128i_i128;                                                 \
10488-		v2 = temp2.m128i_i128;                                                 \
10489-	}
10490-
10491-#define stbir__simdi_table_lookup4(v0, v1, v2, v3, table)                      \
10492-	{                                                                          \
10493-		stbir__simdi_u32 temp0, temp1, temp2, temp3;                           \
10494-		temp0.m128i_i128 = v0;                                                 \
10495-		temp1.m128i_i128 = v1;                                                 \
10496-		temp2.m128i_i128 = v2;                                                 \
10497-		temp3.m128i_i128 = v3;                                                 \
10498-		temp0.m128i_u32[0] = table[temp0.m128i_i32[0]];                        \
10499-		temp0.m128i_u32[1] = table[temp0.m128i_i32[1]];                        \
10500-		temp0.m128i_u32[2] = table[temp0.m128i_i32[2]];                        \
10501-		temp0.m128i_u32[3] = table[temp0.m128i_i32[3]];                        \
10502-		temp1.m128i_u32[0] = table[temp1.m128i_i32[0]];                        \
10503-		temp1.m128i_u32[1] = table[temp1.m128i_i32[1]];                        \
10504-		temp1.m128i_u32[2] = table[temp1.m128i_i32[2]];                        \
10505-		temp1.m128i_u32[3] = table[temp1.m128i_i32[3]];                        \
10506-		temp2.m128i_u32[0] = table[temp2.m128i_i32[0]];                        \
10507-		temp2.m128i_u32[1] = table[temp2.m128i_i32[1]];                        \
10508-		temp2.m128i_u32[2] = table[temp2.m128i_i32[2]];                        \
10509-		temp2.m128i_u32[3] = table[temp2.m128i_i32[3]];                        \
10510-		temp3.m128i_u32[0] = table[temp3.m128i_i32[0]];                        \
10511-		temp3.m128i_u32[1] = table[temp3.m128i_i32[1]];                        \
10512-		temp3.m128i_u32[2] = table[temp3.m128i_i32[2]];                        \
10513-		temp3.m128i_u32[3] = table[temp3.m128i_i32[3]];                        \
10514-		v0 = temp0.m128i_i128;                                                 \
10515-		v1 = temp1.m128i_i128;                                                 \
10516-		v2 = temp2.m128i_i128;                                                 \
10517-		v3 = temp3.m128i_i128;                                                 \
10518-	}
10519-
10520-static void
10521-STBIR__CODER_NAME(stbir__encode_uint8_srgb)(void *outputp,
10522-                                            int width_times_channels,
10523-                                            float const *encode)
10524-{
10525-	unsigned char STBIR_SIMD_STREAMOUT_PTR(*) output = (unsigned char *)outputp;
10526-	unsigned char *end_output =
10527-	    ((unsigned char *)output) + width_times_channels;
10528-
10529-#ifdef STBIR_SIMD
10530-
10531-	if (width_times_channels >= 16) {
10532-		float const *end_encode_m16 = encode + width_times_channels - 16;
10533-		end_output -= 16;
10534-		STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
10535-		for (;;) {
10536-			stbir__simdf f0, f1, f2, f3;
10537-			stbir__simdi i0, i1, i2, i3;
10538-			STBIR_SIMD_NO_UNROLL(encode);
10539-
10540-			stbir__simdf_load4_transposed(f0, f1, f2, f3, encode);
10541-
10542-			stbir__min_max_shift20(i0, f0);
10543-			stbir__min_max_shift20(i1, f1);
10544-			stbir__min_max_shift20(i2, f2);
10545-			stbir__min_max_shift20(i3, f3);
10546-
10547-			stbir__simdi_table_lookup4(i0, i1, i2, i3,
10548-			                           (fp32_to_srgb8_tab4 - (127 - 13) * 8));
10549-
10550-			stbir__linear_to_srgb_finish(i0, f0);
10551-			stbir__linear_to_srgb_finish(i1, f1);
10552-			stbir__linear_to_srgb_finish(i2, f2);
10553-			stbir__linear_to_srgb_finish(i3, f3);
10554-
10555-			stbir__interleave_pack_and_store_16_u8(
10556-			    output, STBIR_strs_join1(i, , stbir__encode_order0),
10557-			    STBIR_strs_join1(i, , stbir__encode_order1),
10558-			    STBIR_strs_join1(i, , stbir__encode_order2),
10559-			    STBIR_strs_join1(i, , stbir__encode_order3));
10560-
10561-			encode += 16;
10562-			output += 16;
10563-			if (output <= end_output) {
10564-				continue;
10565-			}
10566-			if (output == (end_output + 16)) {
10567-				break;
10568-			}
10569-			output = end_output; // backup and do last couple
10570-			encode = end_encode_m16;
10571-		}
10572-		return;
10573-	}
10574-#endif
10575-
10576-// try to do blocks of 4 when you can
10577-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
10578-	output += 4;
10579-	STBIR_SIMD_NO_UNROLL_LOOP_START
10580-	while (output <= end_output) {
10581-		STBIR_SIMD_NO_UNROLL(encode);
10582-
10583-		output[0 - 4] =
10584-		    stbir__linear_to_srgb_uchar(encode[stbir__encode_order0]);
10585-		output[1 - 4] =
10586-		    stbir__linear_to_srgb_uchar(encode[stbir__encode_order1]);
10587-		output[2 - 4] =
10588-		    stbir__linear_to_srgb_uchar(encode[stbir__encode_order2]);
10589-		output[3 - 4] =
10590-		    stbir__linear_to_srgb_uchar(encode[stbir__encode_order3]);
10591-
10592-		output += 4;
10593-		encode += 4;
10594-	}
10595-	output -= 4;
10596-#endif
10597-
10598-// do the remnants
10599-#if stbir__coder_min_num < 4
10600-	STBIR_NO_UNROLL_LOOP_START
10601-	while (output < end_output) {
10602-		STBIR_NO_UNROLL(encode);
10603-		output[0] = stbir__linear_to_srgb_uchar(encode[stbir__encode_order0]);
10604-#if stbir__coder_min_num >= 2
10605-		output[1] = stbir__linear_to_srgb_uchar(encode[stbir__encode_order1]);
10606-#endif
10607-#if stbir__coder_min_num >= 3
10608-		output[2] = stbir__linear_to_srgb_uchar(encode[stbir__encode_order2]);
10609-#endif
10610-		output += stbir__coder_min_num;
10611-		encode += stbir__coder_min_num;
10612-	}
10613-#endif
10614-}
10615-
10616-#if (stbir__coder_min_num == 4) ||                                             \
10617-    ((stbir__coder_min_num == 1) && (!defined(stbir__decode_swizzle)))
10618-
10619-static float *
10620-STBIR__CODER_NAME(stbir__decode_uint8_srgb4_linearalpha)(
10621-    float *decodep, int width_times_channels, void const *inputp)
10622-{
10623-	float STBIR_STREAMOUT_PTR(*) decode = decodep;
10624-	float *decode_end = (float *)decode + width_times_channels;
10625-	unsigned char const *input = (unsigned char const *)inputp;
10626-
10627-	do {
10628-		decode[0] =
10629-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order0]];
10630-		decode[1] =
10631-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order1]];
10632-		decode[2] =
10633-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order2]];
10634-		decode[3] = ((float)input[stbir__decode_order3]) *
10635-		            stbir__max_uint8_as_float_inverted;
10636-		input += 4;
10637-		decode += 4;
10638-	} while (decode < decode_end);
10639-	return decode_end;
10640-}
10641-
10642-static void
10643-STBIR__CODER_NAME(stbir__encode_uint8_srgb4_linearalpha)(
10644-    void *outputp, int width_times_channels, float const *encode)
10645-{
10646-	unsigned char STBIR_SIMD_STREAMOUT_PTR(*) output = (unsigned char *)outputp;
10647-	unsigned char *end_output =
10648-	    ((unsigned char *)output) + width_times_channels;
10649-
10650-#ifdef STBIR_SIMD
10651-
10652-	if (width_times_channels >= 16) {
10653-		float const *end_encode_m16 = encode + width_times_channels - 16;
10654-		end_output -= 16;
10655-		STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
10656-		for (;;) {
10657-			stbir__simdf f0, f1, f2, f3;
10658-			stbir__simdi i0, i1, i2, i3;
10659-
10660-			STBIR_SIMD_NO_UNROLL(encode);
10661-			stbir__simdf_load4_transposed(f0, f1, f2, f3, encode);
10662-
10663-			stbir__min_max_shift20(i0, f0);
10664-			stbir__min_max_shift20(i1, f1);
10665-			stbir__min_max_shift20(i2, f2);
10666-			stbir__scale_and_convert(i3, f3);
10667-
10668-			stbir__simdi_table_lookup3(i0, i1, i2,
10669-			                           (fp32_to_srgb8_tab4 - (127 - 13) * 8));
10670-
10671-			stbir__linear_to_srgb_finish(i0, f0);
10672-			stbir__linear_to_srgb_finish(i1, f1);
10673-			stbir__linear_to_srgb_finish(i2, f2);
10674-
10675-			stbir__interleave_pack_and_store_16_u8(
10676-			    output, STBIR_strs_join1(i, , stbir__encode_order0),
10677-			    STBIR_strs_join1(i, , stbir__encode_order1),
10678-			    STBIR_strs_join1(i, , stbir__encode_order2),
10679-			    STBIR_strs_join1(i, , stbir__encode_order3));
10680-
10681-			output += 16;
10682-			encode += 16;
10683-
10684-			if (output <= end_output) {
10685-				continue;
10686-			}
10687-			if (output == (end_output + 16)) {
10688-				break;
10689-			}
10690-			output = end_output; // backup and do last couple
10691-			encode = end_encode_m16;
10692-		}
10693-		return;
10694-	}
10695-#endif
10696-
10697-	STBIR_SIMD_NO_UNROLL_LOOP_START
10698-	do {
10699-		float f;
10700-		STBIR_SIMD_NO_UNROLL(encode);
10701-
10702-		output[stbir__decode_order0] = stbir__linear_to_srgb_uchar(encode[0]);
10703-		output[stbir__decode_order1] = stbir__linear_to_srgb_uchar(encode[1]);
10704-		output[stbir__decode_order2] = stbir__linear_to_srgb_uchar(encode[2]);
10705-
10706-		f = encode[3] * stbir__max_uint8_as_float + 0.5f;
10707-		STBIR_CLAMP(f, 0, 255);
10708-		output[stbir__decode_order3] = (unsigned char)f;
10709-
10710-		output += 4;
10711-		encode += 4;
10712-	} while (output < end_output);
10713-}
10714-
10715-#endif
10716-
10717-#if (stbir__coder_min_num == 2) ||                                             \
10718-    ((stbir__coder_min_num == 1) && (!defined(stbir__decode_swizzle)))
10719-
10720-static float *
10721-STBIR__CODER_NAME(stbir__decode_uint8_srgb2_linearalpha)(
10722-    float *decodep, int width_times_channels, void const *inputp)
10723-{
10724-	float STBIR_STREAMOUT_PTR(*) decode = decodep;
10725-	float *decode_end = (float *)decode + width_times_channels;
10726-	unsigned char const *input = (unsigned char const *)inputp;
10727-
10728-	decode += 4;
10729-	while (decode <= decode_end) {
10730-		decode[0 - 4] =
10731-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order0]];
10732-		decode[1 - 4] = ((float)input[stbir__decode_order1]) *
10733-		                stbir__max_uint8_as_float_inverted;
10734-		decode[2 - 4] =
10735-		    stbir__srgb_uchar_to_linear_float[input[stbir__decode_order0 + 2]];
10736-		decode[3 - 4] = ((float)input[stbir__decode_order1 + 2]) *
10737-		                stbir__max_uint8_as_float_inverted;
10738-		input += 4;
10739-		decode += 4;
10740-	}
10741-	decode -= 4;
10742-	if (decode < decode_end) {
10743-		decode[0] = stbir__srgb_uchar_to_linear_float[stbir__decode_order0];
10744-		decode[1] = ((float)input[stbir__decode_order1]) *
10745-		            stbir__max_uint8_as_float_inverted;
10746-	}
10747-	return decode_end;
10748-}
10749-
10750-static void
10751-STBIR__CODER_NAME(stbir__encode_uint8_srgb2_linearalpha)(
10752-    void *outputp, int width_times_channels, float const *encode)
10753-{
10754-	unsigned char STBIR_SIMD_STREAMOUT_PTR(*) output = (unsigned char *)outputp;
10755-	unsigned char *end_output =
10756-	    ((unsigned char *)output) + width_times_channels;
10757-
10758-#ifdef STBIR_SIMD
10759-
10760-	if (width_times_channels >= 16) {
10761-		float const *end_encode_m16 = encode + width_times_channels - 16;
10762-		end_output -= 16;
10763-		STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
10764-		for (;;) {
10765-			stbir__simdf f0, f1, f2, f3;
10766-			stbir__simdi i0, i1, i2, i3;
10767-
10768-			STBIR_SIMD_NO_UNROLL(encode);
10769-			stbir__simdf_load4_transposed(f0, f1, f2, f3, encode);
10770-
10771-			stbir__min_max_shift20(i0, f0);
10772-			stbir__scale_and_convert(i1, f1);
10773-			stbir__min_max_shift20(i2, f2);
10774-			stbir__scale_and_convert(i3, f3);
10775-
10776-			stbir__simdi_table_lookup2(i0, i2,
10777-			                           (fp32_to_srgb8_tab4 - (127 - 13) * 8));
10778-
10779-			stbir__linear_to_srgb_finish(i0, f0);
10780-			stbir__linear_to_srgb_finish(i2, f2);
10781-
10782-			stbir__interleave_pack_and_store_16_u8(
10783-			    output, STBIR_strs_join1(i, , stbir__encode_order0),
10784-			    STBIR_strs_join1(i, , stbir__encode_order1),
10785-			    STBIR_strs_join1(i, , stbir__encode_order2),
10786-			    STBIR_strs_join1(i, , stbir__encode_order3));
10787-
10788-			output += 16;
10789-			encode += 16;
10790-			if (output <= end_output) {
10791-				continue;
10792-			}
10793-			if (output == (end_output + 16)) {
10794-				break;
10795-			}
10796-			output = end_output; // backup and do last couple
10797-			encode = end_encode_m16;
10798-		}
10799-		return;
10800-	}
10801-#endif
10802-
10803-	STBIR_SIMD_NO_UNROLL_LOOP_START
10804-	do {
10805-		float f;
10806-		STBIR_SIMD_NO_UNROLL(encode);
10807-
10808-		output[stbir__decode_order0] = stbir__linear_to_srgb_uchar(encode[0]);
10809-
10810-		f = encode[1] * stbir__max_uint8_as_float + 0.5f;
10811-		STBIR_CLAMP(f, 0, 255);
10812-		output[stbir__decode_order1] = (unsigned char)f;
10813-
10814-		output += 2;
10815-		encode += 2;
10816-	} while (output < end_output);
10817-}
10818-
10819-#endif
10820-
10821-static float *
10822-STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)(float *decodep,
10823-                                                      int width_times_channels,
10824-                                                      void const *inputp)
10825-{
10826-	float STBIR_STREAMOUT_PTR(*) decode = decodep;
10827-	float *decode_end = (float *)decode + width_times_channels;
10828-	unsigned short const *input = (unsigned short const *)inputp;
10829-
10830-#ifdef STBIR_SIMD
10831-	unsigned short const *end_input_m8 = input + width_times_channels - 8;
10832-	if (width_times_channels >= 8) {
10833-		decode_end -= 8;
10834-		STBIR_NO_UNROLL_LOOP_START_INF_FOR
10835-		for (;;) {
10836-#ifdef STBIR_SIMD8
10837-			stbir__simdi i;
10838-			stbir__simdi8 o;
10839-			stbir__simdf8 of;
10840-			STBIR_NO_UNROLL(decode);
10841-			stbir__simdi_load(i, input);
10842-			stbir__simdi8_expand_u16_to_u32(o, i);
10843-			stbir__simdi8_convert_i32_to_float(of, o);
10844-			stbir__simdf8_mult(of, of, STBIR_max_uint16_as_float_inverted8);
10845-			stbir__decode_simdf8_flip(of);
10846-			stbir__simdf8_store(decode + 0, of);
10847-#else
10848-			stbir__simdi i, o0, o1;
10849-			stbir__simdf of0, of1;
10850-			STBIR_NO_UNROLL(decode);
10851-			stbir__simdi_load(i, input);
10852-			stbir__simdi_expand_u16_to_u32(o0, o1, i);
10853-			stbir__simdi_convert_i32_to_float(of0, o0);
10854-			stbir__simdi_convert_i32_to_float(of1, o1);
10855-			stbir__simdf_mult(
10856-			    of0, of0, STBIR__CONSTF(STBIR_max_uint16_as_float_inverted));
10857-			stbir__simdf_mult(
10858-			    of1, of1, STBIR__CONSTF(STBIR_max_uint16_as_float_inverted));
10859-			stbir__decode_simdf4_flip(of0);
10860-			stbir__decode_simdf4_flip(of1);
10861-			stbir__simdf_store(decode + 0, of0);
10862-			stbir__simdf_store(decode + 4, of1);
10863-#endif
10864-			decode += 8;
10865-			input += 8;
10866-			if (decode <= decode_end) {
10867-				continue;
10868-			}
10869-			if (decode == (decode_end + 8)) {
10870-				break;
10871-			}
10872-			decode = decode_end; // backup and do last couple
10873-			input = end_input_m8;
10874-		}
10875-		return decode_end + 8;
10876-	}
10877-#endif
10878-
10879-// try to do blocks of 4 when you can
10880-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
10881-	decode += 4;
10882-	STBIR_SIMD_NO_UNROLL_LOOP_START
10883-	while (decode <= decode_end) {
10884-		STBIR_SIMD_NO_UNROLL(decode);
10885-		decode[0 - 4] = ((float)(input[stbir__decode_order0])) *
10886-		                stbir__max_uint16_as_float_inverted;
10887-		decode[1 - 4] = ((float)(input[stbir__decode_order1])) *
10888-		                stbir__max_uint16_as_float_inverted;
10889-		decode[2 - 4] = ((float)(input[stbir__decode_order2])) *
10890-		                stbir__max_uint16_as_float_inverted;
10891-		decode[3 - 4] = ((float)(input[stbir__decode_order3])) *
10892-		                stbir__max_uint16_as_float_inverted;
10893-		decode += 4;
10894-		input += 4;
10895-	}
10896-	decode -= 4;
10897-#endif
10898-
10899-// do the remnants
10900-#if stbir__coder_min_num < 4
10901-	STBIR_NO_UNROLL_LOOP_START
10902-	while (decode < decode_end) {
10903-		STBIR_NO_UNROLL(decode);
10904-		decode[0] = ((float)(input[stbir__decode_order0])) *
10905-		            stbir__max_uint16_as_float_inverted;
10906-#if stbir__coder_min_num >= 2
10907-		decode[1] = ((float)(input[stbir__decode_order1])) *
10908-		            stbir__max_uint16_as_float_inverted;
10909-#endif
10910-#if stbir__coder_min_num >= 3
10911-		decode[2] = ((float)(input[stbir__decode_order2])) *
10912-		            stbir__max_uint16_as_float_inverted;
10913-#endif
10914-		decode += stbir__coder_min_num;
10915-		input += stbir__coder_min_num;
10916-	}
10917-#endif
10918-	return decode_end;
10919-}
10920-
10921-static void
10922-STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)(void *outputp,
10923-                                                      int width_times_channels,
10924-                                                      float const *encode)
10925-{
10926-	unsigned short STBIR_SIMD_STREAMOUT_PTR(*) output =
10927-	    (unsigned short *)outputp;
10928-	unsigned short *end_output =
10929-	    ((unsigned short *)output) + width_times_channels;
10930-
10931-#ifdef STBIR_SIMD
10932-	{
10933-		if (width_times_channels >= stbir__simdfX_float_count * 2) {
10934-			float const *end_encode_m8 =
10935-			    encode + width_times_channels - stbir__simdfX_float_count * 2;
10936-			end_output -= stbir__simdfX_float_count * 2;
10937-			STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
10938-			for (;;) {
10939-				stbir__simdfX e0, e1;
10940-				stbir__simdiX i;
10941-				STBIR_SIMD_NO_UNROLL(encode);
10942-				stbir__simdfX_madd_mem(e0, STBIR_simd_point5X,
10943-				                       STBIR_max_uint16_as_floatX, encode);
10944-				stbir__simdfX_madd_mem(e1, STBIR_simd_point5X,
10945-				                       STBIR_max_uint16_as_floatX,
10946-				                       encode + stbir__simdfX_float_count);
10947-				stbir__encode_simdfX_unflip(e0);
10948-				stbir__encode_simdfX_unflip(e1);
10949-				stbir__simdfX_pack_to_words(i, e0, e1);
10950-				stbir__simdiX_store(output, i);
10951-				encode += stbir__simdfX_float_count * 2;
10952-				output += stbir__simdfX_float_count * 2;
10953-				if (output <= end_output) {
10954-					continue;
10955-				}
10956-				if (output == (end_output + stbir__simdfX_float_count * 2)) {
10957-					break;
10958-				}
10959-				output = end_output; // backup and do last couple
10960-				encode = end_encode_m8;
10961-			}
10962-			return;
10963-		}
10964-	}
10965-
10966-// try to do blocks of 4 when you can
10967-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
10968-	output += 4;
10969-	STBIR_NO_UNROLL_LOOP_START
10970-	while (output <= end_output) {
10971-		stbir__simdf e;
10972-		stbir__simdi i;
10973-		STBIR_NO_UNROLL(encode);
10974-		stbir__simdf_load(e, encode);
10975-		stbir__simdf_madd(e, STBIR__CONSTF(STBIR_simd_point5),
10976-		                  STBIR__CONSTF(STBIR_max_uint16_as_float), e);
10977-		stbir__encode_simdf4_unflip(e);
10978-		stbir__simdf_pack_to_8words(i, e, e); // only use first 4
10979-		stbir__simdi_store2(output - 4, i);
10980-		output += 4;
10981-		encode += 4;
10982-	}
10983-	output -= 4;
10984-#endif
10985-
10986-// do the remnants
10987-#if stbir__coder_min_num < 4
10988-	STBIR_NO_UNROLL_LOOP_START
10989-	while (output < end_output) {
10990-		stbir__simdf e;
10991-		STBIR_NO_UNROLL(encode);
10992-		stbir__simdf_madd1_mem(e, STBIR__CONSTF(STBIR_simd_point5),
10993-		                       STBIR__CONSTF(STBIR_max_uint16_as_float),
10994-		                       encode + stbir__encode_order0);
10995-		output[0] = stbir__simdf_convert_float_to_short(e);
10996-#if stbir__coder_min_num >= 2
10997-		stbir__simdf_madd1_mem(e, STBIR__CONSTF(STBIR_simd_point5),
10998-		                       STBIR__CONSTF(STBIR_max_uint16_as_float),
10999-		                       encode + stbir__encode_order1);
11000-		output[1] = stbir__simdf_convert_float_to_short(e);
11001-#endif
11002-#if stbir__coder_min_num >= 3
11003-		stbir__simdf_madd1_mem(e, STBIR__CONSTF(STBIR_simd_point5),
11004-		                       STBIR__CONSTF(STBIR_max_uint16_as_float),
11005-		                       encode + stbir__encode_order2);
11006-		output[2] = stbir__simdf_convert_float_to_short(e);
11007-#endif
11008-		output += stbir__coder_min_num;
11009-		encode += stbir__coder_min_num;
11010-	}
11011-#endif
11012-
11013-#else
11014-
11015-// try to do blocks of 4 when you can
11016-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
11017-	output += 4;
11018-	STBIR_SIMD_NO_UNROLL_LOOP_START
11019-	while (output <= end_output) {
11020-		float f;
11021-		STBIR_SIMD_NO_UNROLL(encode);
11022-		f = encode[stbir__encode_order0] * stbir__max_uint16_as_float + 0.5f;
11023-		STBIR_CLAMP(f, 0, 65535);
11024-		output[0 - 4] = (unsigned short)f;
11025-		f = encode[stbir__encode_order1] * stbir__max_uint16_as_float + 0.5f;
11026-		STBIR_CLAMP(f, 0, 65535);
11027-		output[1 - 4] = (unsigned short)f;
11028-		f = encode[stbir__encode_order2] * stbir__max_uint16_as_float + 0.5f;
11029-		STBIR_CLAMP(f, 0, 65535);
11030-		output[2 - 4] = (unsigned short)f;
11031-		f = encode[stbir__encode_order3] * stbir__max_uint16_as_float + 0.5f;
11032-		STBIR_CLAMP(f, 0, 65535);
11033-		output[3 - 4] = (unsigned short)f;
11034-		output += 4;
11035-		encode += 4;
11036-	}
11037-	output -= 4;
11038-#endif
11039-
11040-// do the remnants
11041-#if stbir__coder_min_num < 4
11042-	STBIR_NO_UNROLL_LOOP_START
11043-	while (output < end_output) {
11044-		float f;
11045-		STBIR_NO_UNROLL(encode);
11046-		f = encode[stbir__encode_order0] * stbir__max_uint16_as_float + 0.5f;
11047-		STBIR_CLAMP(f, 0, 65535);
11048-		output[0] = (unsigned short)f;
11049-#if stbir__coder_min_num >= 2
11050-		f = encode[stbir__encode_order1] * stbir__max_uint16_as_float + 0.5f;
11051-		STBIR_CLAMP(f, 0, 65535);
11052-		output[1] = (unsigned short)f;
11053-#endif
11054-#if stbir__coder_min_num >= 3
11055-		f = encode[stbir__encode_order2] * stbir__max_uint16_as_float + 0.5f;
11056-		STBIR_CLAMP(f, 0, 65535);
11057-		output[2] = (unsigned short)f;
11058-#endif
11059-		output += stbir__coder_min_num;
11060-		encode += stbir__coder_min_num;
11061-	}
11062-#endif
11063-#endif
11064-}
11065-
11066-static float *
11067-STBIR__CODER_NAME(stbir__decode_uint16_linear)(float *decodep,
11068-                                               int width_times_channels,
11069-                                               void const *inputp)
11070-{
11071-	float STBIR_STREAMOUT_PTR(*) decode = decodep;
11072-	float *decode_end = (float *)decode + width_times_channels;
11073-	unsigned short const *input = (unsigned short const *)inputp;
11074-
11075-#ifdef STBIR_SIMD
11076-	unsigned short const *end_input_m8 = input + width_times_channels - 8;
11077-	if (width_times_channels >= 8) {
11078-		decode_end -= 8;
11079-		STBIR_NO_UNROLL_LOOP_START_INF_FOR
11080-		for (;;) {
11081-#ifdef STBIR_SIMD8
11082-			stbir__simdi i;
11083-			stbir__simdi8 o;
11084-			stbir__simdf8 of;
11085-			STBIR_NO_UNROLL(decode);
11086-			stbir__simdi_load(i, input);
11087-			stbir__simdi8_expand_u16_to_u32(o, i);
11088-			stbir__simdi8_convert_i32_to_float(of, o);
11089-			stbir__decode_simdf8_flip(of);
11090-			stbir__simdf8_store(decode + 0, of);
11091-#else
11092-			stbir__simdi i, o0, o1;
11093-			stbir__simdf of0, of1;
11094-			STBIR_NO_UNROLL(decode);
11095-			stbir__simdi_load(i, input);
11096-			stbir__simdi_expand_u16_to_u32(o0, o1, i);
11097-			stbir__simdi_convert_i32_to_float(of0, o0);
11098-			stbir__simdi_convert_i32_to_float(of1, o1);
11099-			stbir__decode_simdf4_flip(of0);
11100-			stbir__decode_simdf4_flip(of1);
11101-			stbir__simdf_store(decode + 0, of0);
11102-			stbir__simdf_store(decode + 4, of1);
11103-#endif
11104-			decode += 8;
11105-			input += 8;
11106-			if (decode <= decode_end) {
11107-				continue;
11108-			}
11109-			if (decode == (decode_end + 8)) {
11110-				break;
11111-			}
11112-			decode = decode_end; // backup and do last couple
11113-			input = end_input_m8;
11114-		}
11115-		return decode_end + 8;
11116-	}
11117-#endif
11118-
11119-// try to do blocks of 4 when you can
11120-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
11121-	decode += 4;
11122-	STBIR_SIMD_NO_UNROLL_LOOP_START
11123-	while (decode <= decode_end) {
11124-		STBIR_SIMD_NO_UNROLL(decode);
11125-		decode[0 - 4] = ((float)(input[stbir__decode_order0]));
11126-		decode[1 - 4] = ((float)(input[stbir__decode_order1]));
11127-		decode[2 - 4] = ((float)(input[stbir__decode_order2]));
11128-		decode[3 - 4] = ((float)(input[stbir__decode_order3]));
11129-		decode += 4;
11130-		input += 4;
11131-	}
11132-	decode -= 4;
11133-#endif
11134-
11135-// do the remnants
11136-#if stbir__coder_min_num < 4
11137-	STBIR_NO_UNROLL_LOOP_START
11138-	while (decode < decode_end) {
11139-		STBIR_NO_UNROLL(decode);
11140-		decode[0] = ((float)(input[stbir__decode_order0]));
11141-#if stbir__coder_min_num >= 2
11142-		decode[1] = ((float)(input[stbir__decode_order1]));
11143-#endif
11144-#if stbir__coder_min_num >= 3
11145-		decode[2] = ((float)(input[stbir__decode_order2]));
11146-#endif
11147-		decode += stbir__coder_min_num;
11148-		input += stbir__coder_min_num;
11149-	}
11150-#endif
11151-	return decode_end;
11152-}
11153-
11154-static void
11155-STBIR__CODER_NAME(stbir__encode_uint16_linear)(void *outputp,
11156-                                               int width_times_channels,
11157-                                               float const *encode)
11158-{
11159-	unsigned short STBIR_SIMD_STREAMOUT_PTR(*) output =
11160-	    (unsigned short *)outputp;
11161-	unsigned short *end_output =
11162-	    ((unsigned short *)output) + width_times_channels;
11163-
11164-#ifdef STBIR_SIMD
11165-	{
11166-		if (width_times_channels >= stbir__simdfX_float_count * 2) {
11167-			float const *end_encode_m8 =
11168-			    encode + width_times_channels - stbir__simdfX_float_count * 2;
11169-			end_output -= stbir__simdfX_float_count * 2;
11170-			STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
11171-			for (;;) {
11172-				stbir__simdfX e0, e1;
11173-				stbir__simdiX i;
11174-				STBIR_SIMD_NO_UNROLL(encode);
11175-				stbir__simdfX_add_mem(e0, STBIR_simd_point5X, encode);
11176-				stbir__simdfX_add_mem(e1, STBIR_simd_point5X,
11177-				                      encode + stbir__simdfX_float_count);
11178-				stbir__encode_simdfX_unflip(e0);
11179-				stbir__encode_simdfX_unflip(e1);
11180-				stbir__simdfX_pack_to_words(i, e0, e1);
11181-				stbir__simdiX_store(output, i);
11182-				encode += stbir__simdfX_float_count * 2;
11183-				output += stbir__simdfX_float_count * 2;
11184-				if (output <= end_output) {
11185-					continue;
11186-				}
11187-				if (output == (end_output + stbir__simdfX_float_count * 2)) {
11188-					break;
11189-				}
11190-				output = end_output; // backup and do last couple
11191-				encode = end_encode_m8;
11192-			}
11193-			return;
11194-		}
11195-	}
11196-
11197-// try to do blocks of 4 when you can
11198-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
11199-	output += 4;
11200-	STBIR_NO_UNROLL_LOOP_START
11201-	while (output <= end_output) {
11202-		stbir__simdf e;
11203-		stbir__simdi i;
11204-		STBIR_NO_UNROLL(encode);
11205-		stbir__simdf_load(e, encode);
11206-		stbir__simdf_add(e, STBIR__CONSTF(STBIR_simd_point5), e);
11207-		stbir__encode_simdf4_unflip(e);
11208-		stbir__simdf_pack_to_8words(i, e, e); // only use first 4
11209-		stbir__simdi_store2(output - 4, i);
11210-		output += 4;
11211-		encode += 4;
11212-	}
11213-	output -= 4;
11214-#endif
11215-
11216-#else
11217-
11218-// try to do blocks of 4 when you can
11219-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
11220-	output += 4;
11221-	STBIR_SIMD_NO_UNROLL_LOOP_START
11222-	while (output <= end_output) {
11223-		float f;
11224-		STBIR_SIMD_NO_UNROLL(encode);
11225-		f = encode[stbir__encode_order0] + 0.5f;
11226-		STBIR_CLAMP(f, 0, 65535);
11227-		output[0 - 4] = (unsigned short)f;
11228-		f = encode[stbir__encode_order1] + 0.5f;
11229-		STBIR_CLAMP(f, 0, 65535);
11230-		output[1 - 4] = (unsigned short)f;
11231-		f = encode[stbir__encode_order2] + 0.5f;
11232-		STBIR_CLAMP(f, 0, 65535);
11233-		output[2 - 4] = (unsigned short)f;
11234-		f = encode[stbir__encode_order3] + 0.5f;
11235-		STBIR_CLAMP(f, 0, 65535);
11236-		output[3 - 4] = (unsigned short)f;
11237-		output += 4;
11238-		encode += 4;
11239-	}
11240-	output -= 4;
11241-#endif
11242-
11243-#endif
11244-
11245-// do the remnants
11246-#if stbir__coder_min_num < 4
11247-	STBIR_NO_UNROLL_LOOP_START
11248-	while (output < end_output) {
11249-		float f;
11250-		STBIR_NO_UNROLL(encode);
11251-		f = encode[stbir__encode_order0] + 0.5f;
11252-		STBIR_CLAMP(f, 0, 65535);
11253-		output[0] = (unsigned short)f;
11254-#if stbir__coder_min_num >= 2
11255-		f = encode[stbir__encode_order1] + 0.5f;
11256-		STBIR_CLAMP(f, 0, 65535);
11257-		output[1] = (unsigned short)f;
11258-#endif
11259-#if stbir__coder_min_num >= 3
11260-		f = encode[stbir__encode_order2] + 0.5f;
11261-		STBIR_CLAMP(f, 0, 65535);
11262-		output[2] = (unsigned short)f;
11263-#endif
11264-		output += stbir__coder_min_num;
11265-		encode += stbir__coder_min_num;
11266-	}
11267-#endif
11268-}
11269-
11270-static float *
11271-STBIR__CODER_NAME(stbir__decode_half_float_linear)(float *decodep,
11272-                                                   int width_times_channels,
11273-                                                   void const *inputp)
11274-{
11275-	float STBIR_STREAMOUT_PTR(*) decode = decodep;
11276-	float *decode_end = (float *)decode + width_times_channels;
11277-	stbir__FP16 const *input = (stbir__FP16 const *)inputp;
11278-
11279-#ifdef STBIR_SIMD
11280-	if (width_times_channels >= 8) {
11281-		stbir__FP16 const *end_input_m8 = input + width_times_channels - 8;
11282-		decode_end -= 8;
11283-		STBIR_NO_UNROLL_LOOP_START_INF_FOR
11284-		for (;;) {
11285-			STBIR_NO_UNROLL(decode);
11286-
11287-			stbir__half_to_float_SIMD(decode, input);
11288-#ifdef stbir__decode_swizzle
11289-#ifdef STBIR_SIMD8
11290-			{
11291-				stbir__simdf8 of;
11292-				stbir__simdf8_load(of, decode);
11293-				stbir__decode_simdf8_flip(of);
11294-				stbir__simdf8_store(decode, of);
11295-			}
11296-#else
11297-			{
11298-				stbir__simdf of0, of1;
11299-				stbir__simdf_load(of0, decode);
11300-				stbir__simdf_load(of1, decode + 4);
11301-				stbir__decode_simdf4_flip(of0);
11302-				stbir__decode_simdf4_flip(of1);
11303-				stbir__simdf_store(decode, of0);
11304-				stbir__simdf_store(decode + 4, of1);
11305-			}
11306-#endif
11307-#endif
11308-			decode += 8;
11309-			input += 8;
11310-			if (decode <= decode_end) {
11311-				continue;
11312-			}
11313-			if (decode == (decode_end + 8)) {
11314-				break;
11315-			}
11316-			decode = decode_end; // backup and do last couple
11317-			input = end_input_m8;
11318-		}
11319-		return decode_end + 8;
11320-	}
11321-#endif
11322-
11323-// try to do blocks of 4 when you can
11324-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
11325-	decode += 4;
11326-	STBIR_SIMD_NO_UNROLL_LOOP_START
11327-	while (decode <= decode_end) {
11328-		STBIR_SIMD_NO_UNROLL(decode);
11329-		decode[0 - 4] = stbir__half_to_float(input[stbir__decode_order0]);
11330-		decode[1 - 4] = stbir__half_to_float(input[stbir__decode_order1]);
11331-		decode[2 - 4] = stbir__half_to_float(input[stbir__decode_order2]);
11332-		decode[3 - 4] = stbir__half_to_float(input[stbir__decode_order3]);
11333-		decode += 4;
11334-		input += 4;
11335-	}
11336-	decode -= 4;
11337-#endif
11338-
11339-// do the remnants
11340-#if stbir__coder_min_num < 4
11341-	STBIR_NO_UNROLL_LOOP_START
11342-	while (decode < decode_end) {
11343-		STBIR_NO_UNROLL(decode);
11344-		decode[0] = stbir__half_to_float(input[stbir__decode_order0]);
11345-#if stbir__coder_min_num >= 2
11346-		decode[1] = stbir__half_to_float(input[stbir__decode_order1]);
11347-#endif
11348-#if stbir__coder_min_num >= 3
11349-		decode[2] = stbir__half_to_float(input[stbir__decode_order2]);
11350-#endif
11351-		decode += stbir__coder_min_num;
11352-		input += stbir__coder_min_num;
11353-	}
11354-#endif
11355-	return decode_end;
11356-}
11357-
11358-static void
11359-STBIR__CODER_NAME(stbir__encode_half_float_linear)(void *outputp,
11360-                                                   int width_times_channels,
11361-                                                   float const *encode)
11362-{
11363-	stbir__FP16 STBIR_SIMD_STREAMOUT_PTR(*) output = (stbir__FP16 *)outputp;
11364-	stbir__FP16 *end_output = ((stbir__FP16 *)output) + width_times_channels;
11365-
11366-#ifdef STBIR_SIMD
11367-	if (width_times_channels >= 8) {
11368-		float const *end_encode_m8 = encode + width_times_channels - 8;
11369-		end_output -= 8;
11370-		STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
11371-		for (;;) {
11372-			STBIR_SIMD_NO_UNROLL(encode);
11373-#ifdef stbir__decode_swizzle
11374-#ifdef STBIR_SIMD8
11375-			{
11376-				stbir__simdf8 of;
11377-				stbir__simdf8_load(of, encode);
11378-				stbir__encode_simdf8_unflip(of);
11379-				stbir__float_to_half_SIMD(output, (float *)&of);
11380-			}
11381-#else
11382-			{
11383-				stbir__simdf of[2];
11384-				stbir__simdf_load(of[0], encode);
11385-				stbir__simdf_load(of[1], encode + 4);
11386-				stbir__encode_simdf4_unflip(of[0]);
11387-				stbir__encode_simdf4_unflip(of[1]);
11388-				stbir__float_to_half_SIMD(output, (float *)of);
11389-			}
11390-#endif
11391-#else
11392-			stbir__float_to_half_SIMD(output, encode);
11393-#endif
11394-			encode += 8;
11395-			output += 8;
11396-			if (output <= end_output) {
11397-				continue;
11398-			}
11399-			if (output == (end_output + 8)) {
11400-				break;
11401-			}
11402-			output = end_output; // backup and do last couple
11403-			encode = end_encode_m8;
11404-		}
11405-		return;
11406-	}
11407-#endif
11408-
11409-// try to do blocks of 4 when you can
11410-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
11411-	output += 4;
11412-	STBIR_SIMD_NO_UNROLL_LOOP_START
11413-	while (output <= end_output) {
11414-		STBIR_SIMD_NO_UNROLL(output);
11415-		output[0 - 4] = stbir__float_to_half(encode[stbir__encode_order0]);
11416-		output[1 - 4] = stbir__float_to_half(encode[stbir__encode_order1]);
11417-		output[2 - 4] = stbir__float_to_half(encode[stbir__encode_order2]);
11418-		output[3 - 4] = stbir__float_to_half(encode[stbir__encode_order3]);
11419-		output += 4;
11420-		encode += 4;
11421-	}
11422-	output -= 4;
11423-#endif
11424-
11425-// do the remnants
11426-#if stbir__coder_min_num < 4
11427-	STBIR_NO_UNROLL_LOOP_START
11428-	while (output < end_output) {
11429-		STBIR_NO_UNROLL(output);
11430-		output[0] = stbir__float_to_half(encode[stbir__encode_order0]);
11431-#if stbir__coder_min_num >= 2
11432-		output[1] = stbir__float_to_half(encode[stbir__encode_order1]);
11433-#endif
11434-#if stbir__coder_min_num >= 3
11435-		output[2] = stbir__float_to_half(encode[stbir__encode_order2]);
11436-#endif
11437-		output += stbir__coder_min_num;
11438-		encode += stbir__coder_min_num;
11439-	}
11440-#endif
11441-}
11442-
11443-static float *
11444-STBIR__CODER_NAME(stbir__decode_float_linear)(float *decodep,
11445-                                              int width_times_channels,
11446-                                              void const *inputp)
11447-{
11448-#ifdef stbir__decode_swizzle
11449-	float STBIR_STREAMOUT_PTR(*) decode = decodep;
11450-	float *decode_end = (float *)decode + width_times_channels;
11451-	float const *input = (float const *)inputp;
11452-
11453-#ifdef STBIR_SIMD
11454-	if (width_times_channels >= 16) {
11455-		float const *end_input_m16 = input + width_times_channels - 16;
11456-		decode_end -= 16;
11457-		STBIR_NO_UNROLL_LOOP_START_INF_FOR
11458-		for (;;) {
11459-			STBIR_NO_UNROLL(decode);
11460-#ifdef stbir__decode_swizzle
11461-#ifdef STBIR_SIMD8
11462-			{
11463-				stbir__simdf8 of0, of1;
11464-				stbir__simdf8_load(of0, input);
11465-				stbir__simdf8_load(of1, input + 8);
11466-				stbir__decode_simdf8_flip(of0);
11467-				stbir__decode_simdf8_flip(of1);
11468-				stbir__simdf8_store(decode, of0);
11469-				stbir__simdf8_store(decode + 8, of1);
11470-			}
11471-#else
11472-			{
11473-				stbir__simdf of0, of1, of2, of3;
11474-				stbir__simdf_load(of0, input);
11475-				stbir__simdf_load(of1, input + 4);
11476-				stbir__simdf_load(of2, input + 8);
11477-				stbir__simdf_load(of3, input + 12);
11478-				stbir__decode_simdf4_flip(of0);
11479-				stbir__decode_simdf4_flip(of1);
11480-				stbir__decode_simdf4_flip(of2);
11481-				stbir__decode_simdf4_flip(of3);
11482-				stbir__simdf_store(decode, of0);
11483-				stbir__simdf_store(decode + 4, of1);
11484-				stbir__simdf_store(decode + 8, of2);
11485-				stbir__simdf_store(decode + 12, of3);
11486-			}
11487-#endif
11488-#endif
11489-			decode += 16;
11490-			input += 16;
11491-			if (decode <= decode_end) {
11492-				continue;
11493-			}
11494-			if (decode == (decode_end + 16)) {
11495-				break;
11496-			}
11497-			decode = decode_end; // backup and do last couple
11498-			input = end_input_m16;
11499-		}
11500-		return decode_end + 16;
11501-	}
11502-#endif
11503-
11504-// try to do blocks of 4 when you can
11505-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
11506-	decode += 4;
11507-	STBIR_SIMD_NO_UNROLL_LOOP_START
11508-	while (decode <= decode_end) {
11509-		STBIR_SIMD_NO_UNROLL(decode);
11510-		decode[0 - 4] = input[stbir__decode_order0];
11511-		decode[1 - 4] = input[stbir__decode_order1];
11512-		decode[2 - 4] = input[stbir__decode_order2];
11513-		decode[3 - 4] = input[stbir__decode_order3];
11514-		decode += 4;
11515-		input += 4;
11516-	}
11517-	decode -= 4;
11518-#endif
11519-
11520-// do the remnants
11521-#if stbir__coder_min_num < 4
11522-	STBIR_NO_UNROLL_LOOP_START
11523-	while (decode < decode_end) {
11524-		STBIR_NO_UNROLL(decode);
11525-		decode[0] = input[stbir__decode_order0];
11526-#if stbir__coder_min_num >= 2
11527-		decode[1] = input[stbir__decode_order1];
11528-#endif
11529-#if stbir__coder_min_num >= 3
11530-		decode[2] = input[stbir__decode_order2];
11531-#endif
11532-		decode += stbir__coder_min_num;
11533-		input += stbir__coder_min_num;
11534-	}
11535-#endif
11536-	return decode_end;
11537-
11538-#else
11539-
11540-	if ((void *)decodep != inputp) {
11541-		STBIR_MEMCPY(decodep, inputp, width_times_channels * sizeof(float));
11542-	}
11543-
11544-	return decodep + width_times_channels;
11545-
11546-#endif
11547-}
11548-
11549-static void
11550-STBIR__CODER_NAME(stbir__encode_float_linear)(void *outputp,
11551-                                              int width_times_channels,
11552-                                              float const *encode)
11553-{
11554-#if !defined(STBIR_FLOAT_HIGH_CLAMP) && !defined(STBIR_FLOAT_LO_CLAMP) &&      \
11555-    !defined(stbir__decode_swizzle)
11556-
11557-	if ((void *)outputp != (void *)encode) {
11558-		STBIR_MEMCPY(outputp, encode, width_times_channels * sizeof(float));
11559-	}
11560-
11561-#else
11562-
11563-	float STBIR_SIMD_STREAMOUT_PTR(*) output = (float *)outputp;
11564-	float *end_output = ((float *)output) + width_times_channels;
11565-
11566-#ifdef STBIR_FLOAT_HIGH_CLAMP
11567-#define stbir_scalar_hi_clamp(v)                                               \
11568-	if (v > STBIR_FLOAT_HIGH_CLAMP)                                            \
11569-		v = STBIR_FLOAT_HIGH_CLAMP;
11570-#else
11571-#define stbir_scalar_hi_clamp(v)
11572-#endif
11573-#ifdef STBIR_FLOAT_LOW_CLAMP
11574-#define stbir_scalar_lo_clamp(v)                                               \
11575-	if (v < STBIR_FLOAT_LOW_CLAMP)                                             \
11576-		v = STBIR_FLOAT_LOW_CLAMP;
11577-#else
11578-#define stbir_scalar_lo_clamp(v)
11579-#endif
11580-
11581-#ifdef STBIR_SIMD
11582-
11583-#ifdef STBIR_FLOAT_HIGH_CLAMP
11584-	const stbir__simdfX high_clamp = stbir__simdf_frepX(STBIR_FLOAT_HIGH_CLAMP);
11585-#endif
11586-#ifdef STBIR_FLOAT_LOW_CLAMP
11587-	const stbir__simdfX low_clamp = stbir__simdf_frepX(STBIR_FLOAT_LOW_CLAMP);
11588-#endif
11589-
11590-	if (width_times_channels >= (stbir__simdfX_float_count * 2)) {
11591-		float const *end_encode_m8 =
11592-		    encode + width_times_channels - (stbir__simdfX_float_count * 2);
11593-		end_output -= (stbir__simdfX_float_count * 2);
11594-		STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
11595-		for (;;) {
11596-			stbir__simdfX e0, e1;
11597-			STBIR_SIMD_NO_UNROLL(encode);
11598-			stbir__simdfX_load(e0, encode);
11599-			stbir__simdfX_load(e1, encode + stbir__simdfX_float_count);
11600-#ifdef STBIR_FLOAT_HIGH_CLAMP
11601-			stbir__simdfX_min(e0, e0, high_clamp);
11602-			stbir__simdfX_min(e1, e1, high_clamp);
11603-#endif
11604-#ifdef STBIR_FLOAT_LOW_CLAMP
11605-			stbir__simdfX_max(e0, e0, low_clamp);
11606-			stbir__simdfX_max(e1, e1, low_clamp);
11607-#endif
11608-			stbir__encode_simdfX_unflip(e0);
11609-			stbir__encode_simdfX_unflip(e1);
11610-			stbir__simdfX_store(output, e0);
11611-			stbir__simdfX_store(output + stbir__simdfX_float_count, e1);
11612-			encode += stbir__simdfX_float_count * 2;
11613-			output += stbir__simdfX_float_count * 2;
11614-			if (output < end_output) {
11615-				continue;
11616-			}
11617-			if (output == (end_output + (stbir__simdfX_float_count * 2))) {
11618-				break;
11619-			}
11620-			output = end_output; // backup and do last couple
11621-			encode = end_encode_m8;
11622-		}
11623-		return;
11624-	}
11625-
11626-// try to do blocks of 4 when you can
11627-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
11628-	output += 4;
11629-	STBIR_NO_UNROLL_LOOP_START
11630-	while (output <= end_output) {
11631-		stbir__simdf e0;
11632-		STBIR_NO_UNROLL(encode);
11633-		stbir__simdf_load(e0, encode);
11634-#ifdef STBIR_FLOAT_HIGH_CLAMP
11635-		stbir__simdf_min(e0, e0, high_clamp);
11636-#endif
11637-#ifdef STBIR_FLOAT_LOW_CLAMP
11638-		stbir__simdf_max(e0, e0, low_clamp);
11639-#endif
11640-		stbir__encode_simdf4_unflip(e0);
11641-		stbir__simdf_store(output - 4, e0);
11642-		output += 4;
11643-		encode += 4;
11644-	}
11645-	output -= 4;
11646-#endif
11647-
11648-#else
11649-
11650-// try to do blocks of 4 when you can
11651-#if stbir__coder_min_num != 3 // doesn't divide cleanly by four
11652-	output += 4;
11653-	STBIR_SIMD_NO_UNROLL_LOOP_START
11654-	while (output <= end_output) {
11655-		float e;
11656-		STBIR_SIMD_NO_UNROLL(encode);
11657-		e = encode[stbir__encode_order0];
11658-		stbir_scalar_hi_clamp(e);
11659-		stbir_scalar_lo_clamp(e);
11660-		output[0 - 4] = e;
11661-		e = encode[stbir__encode_order1];
11662-		stbir_scalar_hi_clamp(e);
11663-		stbir_scalar_lo_clamp(e);
11664-		output[1 - 4] = e;
11665-		e = encode[stbir__encode_order2];
11666-		stbir_scalar_hi_clamp(e);
11667-		stbir_scalar_lo_clamp(e);
11668-		output[2 - 4] = e;
11669-		e = encode[stbir__encode_order3];
11670-		stbir_scalar_hi_clamp(e);
11671-		stbir_scalar_lo_clamp(e);
11672-		output[3 - 4] = e;
11673-		output += 4;
11674-		encode += 4;
11675-	}
11676-	output -= 4;
11677-
11678-#endif
11679-
11680-#endif
11681-
11682-// do the remnants
11683-#if stbir__coder_min_num < 4
11684-	STBIR_NO_UNROLL_LOOP_START
11685-	while (output < end_output) {
11686-		float e;
11687-		STBIR_NO_UNROLL(encode);
11688-		e = encode[stbir__encode_order0];
11689-		stbir_scalar_hi_clamp(e);
11690-		stbir_scalar_lo_clamp(e);
11691-		output[0] = e;
11692-#if stbir__coder_min_num >= 2
11693-		e = encode[stbir__encode_order1];
11694-		stbir_scalar_hi_clamp(e);
11695-		stbir_scalar_lo_clamp(e);
11696-		output[1] = e;
11697-#endif
11698-#if stbir__coder_min_num >= 3
11699-		e = encode[stbir__encode_order2];
11700-		stbir_scalar_hi_clamp(e);
11701-		stbir_scalar_lo_clamp(e);
11702-		output[2] = e;
11703-#endif
11704-		output += stbir__coder_min_num;
11705-		encode += stbir__coder_min_num;
11706-	}
11707-#endif
11708-
11709-#endif
11710-}
11711-
11712-#undef stbir__decode_suffix
11713-#undef stbir__decode_simdf8_flip
11714-#undef stbir__decode_simdf4_flip
11715-#undef stbir__decode_order0
11716-#undef stbir__decode_order1
11717-#undef stbir__decode_order2
11718-#undef stbir__decode_order3
11719-#undef stbir__encode_order0
11720-#undef stbir__encode_order1
11721-#undef stbir__encode_order2
11722-#undef stbir__encode_order3
11723-#undef stbir__encode_simdf8_unflip
11724-#undef stbir__encode_simdf4_unflip
11725-#undef stbir__encode_simdfX_unflip
11726-#undef STBIR__CODER_NAME
11727-#undef stbir__coder_min_num
11728-#undef stbir__decode_swizzle
11729-#undef stbir_scalar_hi_clamp
11730-#undef stbir_scalar_lo_clamp
11731-#undef STB_IMAGE_RESIZE_DO_CODERS
11732-
11733-#elif defined(STB_IMAGE_RESIZE_DO_VERTICALS)
11734-
11735-#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
11736-#define STBIR_chans(start, end)                                                \
11737-	STBIR_strs_join14(start, STBIR__vertical_channels, end, _cont)
11738-#else
11739-#define STBIR_chans(start, end)                                                \
11740-	STBIR_strs_join1(start, STBIR__vertical_channels, end)
11741-#endif
11742-
11743-#if STBIR__vertical_channels >= 1
11744-#define stbIF0(code) code
11745-#else
11746-#define stbIF0(code)
11747-#endif
11748-#if STBIR__vertical_channels >= 2
11749-#define stbIF1(code) code
11750-#else
11751-#define stbIF1(code)
11752-#endif
11753-#if STBIR__vertical_channels >= 3
11754-#define stbIF2(code) code
11755-#else
11756-#define stbIF2(code)
11757-#endif
11758-#if STBIR__vertical_channels >= 4
11759-#define stbIF3(code) code
11760-#else
11761-#define stbIF3(code)
11762-#endif
11763-#if STBIR__vertical_channels >= 5
11764-#define stbIF4(code) code
11765-#else
11766-#define stbIF4(code)
11767-#endif
11768-#if STBIR__vertical_channels >= 6
11769-#define stbIF5(code) code
11770-#else
11771-#define stbIF5(code)
11772-#endif
11773-#if STBIR__vertical_channels >= 7
11774-#define stbIF6(code) code
11775-#else
11776-#define stbIF6(code)
11777-#endif
11778-#if STBIR__vertical_channels >= 8
11779-#define stbIF7(code) code
11780-#else
11781-#define stbIF7(code)
11782-#endif
11783-
11784-static void
11785-STBIR_chans(stbir__vertical_scatter_with_,
11786-            _coeffs)(float **outputs,
11787-                     float const *vertical_coefficients,
11788-                     float const *input,
11789-                     float const *input_end)
11790-{
11791-	stbIF0(float STBIR_SIMD_STREAMOUT_PTR(*) output0 = outputs[0];
11792-	       float c0s = vertical_coefficients[0];)
11793-	    stbIF1(float STBIR_SIMD_STREAMOUT_PTR(*) output1 = outputs[1];
11794-	           float c1s = vertical_coefficients[1];)
11795-	        stbIF2(float STBIR_SIMD_STREAMOUT_PTR(*) output2 = outputs[2];
11796-	               float c2s = vertical_coefficients[2];)
11797-	            stbIF3(float STBIR_SIMD_STREAMOUT_PTR(*) output3 = outputs[3];
11798-	                   float c3s = vertical_coefficients[3];)
11799-	                stbIF4(float STBIR_SIMD_STREAMOUT_PTR(*) output4 =
11800-	                           outputs[4];
11801-	                       float c4s = vertical_coefficients[4];)
11802-	                    stbIF5(float STBIR_SIMD_STREAMOUT_PTR(*) output5 =
11803-	                               outputs[5];
11804-	                           float c5s = vertical_coefficients[5];)
11805-	                        stbIF6(float STBIR_SIMD_STREAMOUT_PTR(*) output6 =
11806-	                                   outputs[6];
11807-	                               float c6s = vertical_coefficients[6];)
11808-	                            stbIF7(float STBIR_SIMD_STREAMOUT_PTR(*)
11809-	                                       output7 = outputs[7];
11810-	                                   float c7s = vertical_coefficients[7];)
11811-
11812-#ifdef STBIR_SIMD
11813-	{
11814-		stbIF0(stbir__simdfX c0 = stbir__simdf_frepX(c0s);)
11815-		    stbIF1(stbir__simdfX c1 = stbir__simdf_frepX(c1s);)
11816-		        stbIF2(stbir__simdfX c2 = stbir__simdf_frepX(c2s);) stbIF3(
11817-		            stbir__simdfX c3 = stbir__simdf_frepX(c3s);)
11818-		            stbIF4(stbir__simdfX c4 = stbir__simdf_frepX(c4s);) stbIF5(
11819-		                stbir__simdfX c5 = stbir__simdf_frepX(c5s);)
11820-		                stbIF6(stbir__simdfX c6 = stbir__simdf_frepX(c6s);)
11821-		                    stbIF7(stbir__simdfX c7 = stbir__simdf_frepX(c7s);)
11822-		                        STBIR_SIMD_NO_UNROLL_LOOP_START while (
11823-		                            ((char *)input_end - (char *)input) >=
11824-		                            (16 * stbir__simdfX_float_count))
11825-		{
11826-			stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3;
11827-			STBIR_SIMD_NO_UNROLL(output0);
11828-
11829-			stbir__simdfX_load(r0, input);
11830-			stbir__simdfX_load(r1, input + stbir__simdfX_float_count);
11831-			stbir__simdfX_load(r2, input + (2 * stbir__simdfX_float_count));
11832-			stbir__simdfX_load(r3, input + (3 * stbir__simdfX_float_count));
11833-
11834-#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
11835-			stbIF0(
11836-			    stbir__simdfX_load(o0, output0);
11837-			    stbir__simdfX_load(o1, output0 + stbir__simdfX_float_count);
11838-			    stbir__simdfX_load(o2,
11839-			                       output0 + (2 * stbir__simdfX_float_count));
11840-			    stbir__simdfX_load(o3,
11841-			                       output0 + (3 * stbir__simdfX_float_count));
11842-			    stbir__simdfX_madd(o0, o0, r0, c0);
11843-			    stbir__simdfX_madd(o1, o1, r1, c0);
11844-			    stbir__simdfX_madd(o2, o2, r2, c0);
11845-			    stbir__simdfX_madd(o3, o3, r3, c0);
11846-			    stbir__simdfX_store(output0, o0);
11847-			    stbir__simdfX_store(output0 + stbir__simdfX_float_count, o1);
11848-			    stbir__simdfX_store(output0 + (2 * stbir__simdfX_float_count),
11849-			                        o2);
11850-			    stbir__simdfX_store(
11851-			        output0 + (3 * stbir__simdfX_float_count),
11852-			        o3);) stbIF1(stbir__simdfX_load(o0, output1);
11853-			                     stbir__simdfX_load(
11854-			                         o1, output1 + stbir__simdfX_float_count);
11855-			                     stbir__simdfX_load(
11856-			                         o2,
11857-			                         output1 + (2 * stbir__simdfX_float_count));
11858-			                     stbir__simdfX_load(
11859-			                         o3,
11860-			                         output1 + (3 * stbir__simdfX_float_count));
11861-			                     stbir__simdfX_madd(o0, o0, r0, c1);
11862-			                     stbir__simdfX_madd(o1, o1, r1, c1);
11863-			                     stbir__simdfX_madd(o2, o2, r2, c1);
11864-			                     stbir__simdfX_madd(o3, o3, r3, c1);
11865-			                     stbir__simdfX_store(output1, o0);
11866-			                     stbir__simdfX_store(
11867-			                         output1 + stbir__simdfX_float_count, o1);
11868-			                     stbir__simdfX_store(
11869-			                         output1 + (2 * stbir__simdfX_float_count),
11870-			                         o2);
11871-			                     stbir__simdfX_store(
11872-			                         output1 + (3 * stbir__simdfX_float_count),
11873-			                         o3);)
11874-			    stbIF2(
11875-			        stbir__simdfX_load(o0, output2);
11876-			        stbir__simdfX_load(o1, output2 + stbir__simdfX_float_count);
11877-			        stbir__simdfX_load(
11878-			            o2, output2 + (2 * stbir__simdfX_float_count));
11879-			        stbir__simdfX_load(
11880-			            o3, output2 + (3 * stbir__simdfX_float_count));
11881-			        stbir__simdfX_madd(o0, o0, r0, c2);
11882-			        stbir__simdfX_madd(o1, o1, r1, c2);
11883-			        stbir__simdfX_madd(o2, o2, r2, c2);
11884-			        stbir__simdfX_madd(o3, o3, r3, c2);
11885-			        stbir__simdfX_store(output2, o0);
11886-			        stbir__simdfX_store(output2 + stbir__simdfX_float_count,
11887-			                            o1);
11888-			        stbir__simdfX_store(
11889-			            output2 + (2 * stbir__simdfX_float_count), o2);
11890-			        stbir__simdfX_store(
11891-			            output2 + (3 * stbir__simdfX_float_count),
11892-			            o3);) stbIF3(stbir__simdfX_load(o0, output3);
11893-			                         stbir__simdfX_load(
11894-			                             o1,
11895-			                             output3 + stbir__simdfX_float_count);
11896-			                         stbir__simdfX_load(
11897-			                             o2,
11898-			                             output3 +
11899-			                                 (2 * stbir__simdfX_float_count));
11900-			                         stbir__simdfX_load(
11901-			                             o3,
11902-			                             output3 +
11903-			                                 (3 * stbir__simdfX_float_count));
11904-			                         stbir__simdfX_madd(o0, o0, r0, c3);
11905-			                         stbir__simdfX_madd(o1, o1, r1, c3);
11906-			                         stbir__simdfX_madd(o2, o2, r2, c3);
11907-			                         stbir__simdfX_madd(o3, o3, r3, c3);
11908-			                         stbir__simdfX_store(output3, o0);
11909-			                         stbir__simdfX_store(
11910-			                             output3 + stbir__simdfX_float_count,
11911-			                             o1);
11912-			                         stbir__simdfX_store(
11913-			                             output3 +
11914-			                                 (2 * stbir__simdfX_float_count),
11915-			                             o2);
11916-			                         stbir__simdfX_store(
11917-			                             output3 +
11918-			                                 (3 * stbir__simdfX_float_count),
11919-			                             o3);)
11920-			        stbIF4(stbir__simdfX_load(o0, output4); stbir__simdfX_load(
11921-			                   o1, output4 + stbir__simdfX_float_count);
11922-			               stbir__simdfX_load(
11923-			                   o2, output4 + (2 * stbir__simdfX_float_count));
11924-			               stbir__simdfX_load(
11925-			                   o3, output4 + (3 * stbir__simdfX_float_count));
11926-			               stbir__simdfX_madd(o0, o0, r0, c4);
11927-			               stbir__simdfX_madd(o1, o1, r1, c4);
11928-			               stbir__simdfX_madd(o2, o2, r2, c4);
11929-			               stbir__simdfX_madd(o3, o3, r3, c4);
11930-			               stbir__simdfX_store(output4, o0);
11931-			               stbir__simdfX_store(
11932-			                   output4 + stbir__simdfX_float_count, o1);
11933-			               stbir__simdfX_store(
11934-			                   output4 + (2 * stbir__simdfX_float_count), o2);
11935-			               stbir__simdfX_store(
11936-			                   output4 + (3 * stbir__simdfX_float_count), o3);)
11937-			            stbIF5(
11938-			                stbir__simdfX_load(o0, output5); stbir__simdfX_load(
11939-			                    o1, output5 + stbir__simdfX_float_count);
11940-			                stbir__simdfX_load(
11941-			                    o2, output5 + (2 * stbir__simdfX_float_count));
11942-			                stbir__simdfX_load(
11943-			                    o3, output5 + (3 * stbir__simdfX_float_count));
11944-			                stbir__simdfX_madd(o0, o0, r0, c5);
11945-			                stbir__simdfX_madd(o1, o1, r1, c5);
11946-			                stbir__simdfX_madd(o2, o2, r2, c5);
11947-			                stbir__simdfX_madd(o3, o3, r3, c5);
11948-			                stbir__simdfX_store(output5, o0);
11949-			                stbir__simdfX_store(
11950-			                    output5 + stbir__simdfX_float_count, o1);
11951-			                stbir__simdfX_store(
11952-			                    output5 + (2 * stbir__simdfX_float_count), o2);
11953-			                stbir__simdfX_store(
11954-			                    output5 + (3 * stbir__simdfX_float_count), o3);)
11955-			                stbIF6(
11956-			                    stbir__simdfX_load(o0, output6);
11957-			                    stbir__simdfX_load(
11958-			                        o1, output6 + stbir__simdfX_float_count);
11959-			                    stbir__simdfX_load(
11960-			                        o2,
11961-			                        output6 + (2 * stbir__simdfX_float_count));
11962-			                    stbir__simdfX_load(
11963-			                        o3,
11964-			                        output6 + (3 * stbir__simdfX_float_count));
11965-			                    stbir__simdfX_madd(o0, o0, r0, c6);
11966-			                    stbir__simdfX_madd(o1, o1, r1, c6);
11967-			                    stbir__simdfX_madd(o2, o2, r2, c6);
11968-			                    stbir__simdfX_madd(o3, o3, r3, c6);
11969-			                    stbir__simdfX_store(output6, o0);
11970-			                    stbir__simdfX_store(
11971-			                        output6 + stbir__simdfX_float_count, o1);
11972-			                    stbir__simdfX_store(
11973-			                        output6 + (2 * stbir__simdfX_float_count),
11974-			                        o2);
11975-			                    stbir__simdfX_store(
11976-			                        output6 + (3 * stbir__simdfX_float_count),
11977-			                        o3);)
11978-			                    stbIF7(stbir__simdfX_load(o0, output7);
11979-			                           stbir__simdfX_load(
11980-			                               o1,
11981-			                               output7 + stbir__simdfX_float_count);
11982-			                           stbir__simdfX_load(
11983-			                               o2,
11984-			                               output7 +
11985-			                                   (2 * stbir__simdfX_float_count));
11986-			                           stbir__simdfX_load(
11987-			                               o3,
11988-			                               output7 +
11989-			                                   (3 * stbir__simdfX_float_count));
11990-			                           stbir__simdfX_madd(o0, o0, r0, c7);
11991-			                           stbir__simdfX_madd(o1, o1, r1, c7);
11992-			                           stbir__simdfX_madd(o2, o2, r2, c7);
11993-			                           stbir__simdfX_madd(o3, o3, r3, c7);
11994-			                           stbir__simdfX_store(output7, o0);
11995-			                           stbir__simdfX_store(
11996-			                               output7 + stbir__simdfX_float_count,
11997-			                               o1);
11998-			                           stbir__simdfX_store(
11999-			                               output7 +
12000-			                                   (2 * stbir__simdfX_float_count),
12001-			                               o2);
12002-			                           stbir__simdfX_store(
12003-			                               output7 +
12004-			                                   (3 * stbir__simdfX_float_count),
12005-			                               o3);)
12006-#else
12007-			stbIF0(
12008-			    stbir__simdfX_mult(o0, r0, c0); stbir__simdfX_mult(o1, r1, c0);
12009-			    stbir__simdfX_mult(o2, r2, c0);
12010-			    stbir__simdfX_mult(o3, r3, c0);
12011-			    stbir__simdfX_store(output0, o0);
12012-			    stbir__simdfX_store(output0 + stbir__simdfX_float_count, o1);
12013-			    stbir__simdfX_store(output0 + (2 * stbir__simdfX_float_count),
12014-			                        o2);
12015-			    stbir__simdfX_store(
12016-			        output0 + (3 * stbir__simdfX_float_count),
12017-			        o3);) stbIF1(stbir__simdfX_mult(o0, r0, c1);
12018-			                     stbir__simdfX_mult(o1, r1, c1);
12019-			                     stbir__simdfX_mult(o2, r2, c1);
12020-			                     stbir__simdfX_mult(o3, r3, c1);
12021-			                     stbir__simdfX_store(output1, o0);
12022-			                     stbir__simdfX_store(
12023-			                         output1 + stbir__simdfX_float_count, o1);
12024-			                     stbir__simdfX_store(
12025-			                         output1 + (2 * stbir__simdfX_float_count),
12026-			                         o2);
12027-			                     stbir__simdfX_store(
12028-			                         output1 + (3 * stbir__simdfX_float_count),
12029-			                         o3);)
12030-			    stbIF2(stbir__simdfX_mult(o0, r0, c2);
12031-			           stbir__simdfX_mult(o1, r1, c2);
12032-			           stbir__simdfX_mult(o2, r2, c2);
12033-			           stbir__simdfX_mult(o3, r3, c2);
12034-			           stbir__simdfX_store(output2, o0);
12035-			           stbir__simdfX_store(output2 + stbir__simdfX_float_count,
12036-			                               o1);
12037-			           stbir__simdfX_store(
12038-			               output2 + (2 * stbir__simdfX_float_count), o2);
12039-			           stbir__simdfX_store(
12040-			               output2 + (3 * stbir__simdfX_float_count),
12041-			               o3);) stbIF3(stbir__simdfX_mult(o0, r0, c3);
12042-			                            stbir__simdfX_mult(o1, r1, c3);
12043-			                            stbir__simdfX_mult(o2, r2, c3);
12044-			                            stbir__simdfX_mult(o3, r3, c3);
12045-			                            stbir__simdfX_store(output3, o0);
12046-			                            stbir__simdfX_store(
12047-			                                output3 + stbir__simdfX_float_count,
12048-			                                o1);
12049-			                            stbir__simdfX_store(
12050-			                                output3 +
12051-			                                    (2 * stbir__simdfX_float_count),
12052-			                                o2);
12053-			                            stbir__simdfX_store(
12054-			                                output3 +
12055-			                                    (3 * stbir__simdfX_float_count),
12056-			                                o3);)
12057-			        stbIF4(stbir__simdfX_mult(o0, r0, c4);
12058-			               stbir__simdfX_mult(o1, r1, c4);
12059-			               stbir__simdfX_mult(o2, r2, c4);
12060-			               stbir__simdfX_mult(o3, r3, c4);
12061-			               stbir__simdfX_store(output4, o0);
12062-			               stbir__simdfX_store(
12063-			                   output4 + stbir__simdfX_float_count, o1);
12064-			               stbir__simdfX_store(
12065-			                   output4 + (2 * stbir__simdfX_float_count), o2);
12066-			               stbir__simdfX_store(
12067-			                   output4 + (3 * stbir__simdfX_float_count), o3);)
12068-			            stbIF5(
12069-			                stbir__simdfX_mult(o0, r0, c5);
12070-			                stbir__simdfX_mult(o1, r1, c5);
12071-			                stbir__simdfX_mult(o2, r2, c5);
12072-			                stbir__simdfX_mult(o3, r3, c5);
12073-			                stbir__simdfX_store(output5, o0);
12074-			                stbir__simdfX_store(
12075-			                    output5 + stbir__simdfX_float_count, o1);
12076-			                stbir__simdfX_store(
12077-			                    output5 + (2 * stbir__simdfX_float_count), o2);
12078-			                stbir__simdfX_store(
12079-			                    output5 + (3 * stbir__simdfX_float_count), o3);)
12080-			                stbIF6(
12081-			                    stbir__simdfX_mult(o0, r0, c6);
12082-			                    stbir__simdfX_mult(o1, r1, c6);
12083-			                    stbir__simdfX_mult(o2, r2, c6);
12084-			                    stbir__simdfX_mult(o3, r3, c6);
12085-			                    stbir__simdfX_store(output6, o0);
12086-			                    stbir__simdfX_store(
12087-			                        output6 + stbir__simdfX_float_count, o1);
12088-			                    stbir__simdfX_store(
12089-			                        output6 + (2 * stbir__simdfX_float_count),
12090-			                        o2);
12091-			                    stbir__simdfX_store(
12092-			                        output6 + (3 * stbir__simdfX_float_count),
12093-			                        o3);)
12094-			                    stbIF7(stbir__simdfX_mult(o0, r0, c7);
12095-			                           stbir__simdfX_mult(o1, r1, c7);
12096-			                           stbir__simdfX_mult(o2, r2, c7);
12097-			                           stbir__simdfX_mult(o3, r3, c7);
12098-			                           stbir__simdfX_store(output7, o0);
12099-			                           stbir__simdfX_store(
12100-			                               output7 + stbir__simdfX_float_count,
12101-			                               o1);
12102-			                           stbir__simdfX_store(
12103-			                               output7 +
12104-			                                   (2 * stbir__simdfX_float_count),
12105-			                               o2);
12106-			                           stbir__simdfX_store(
12107-			                               output7 +
12108-			                                   (3 * stbir__simdfX_float_count),
12109-			                               o3);)
12110-#endif
12111-
12112-			                        input += (4 * stbir__simdfX_float_count);
12113-			stbIF0(output0 += (4 * stbir__simdfX_float_count);) stbIF1(
12114-			    output1 += (4 * stbir__simdfX_float_count);)
12115-			    stbIF2(output2 += (4 * stbir__simdfX_float_count);) stbIF3(
12116-			        output3 += (4 * stbir__simdfX_float_count);)
12117-			        stbIF4(output4 += (4 * stbir__simdfX_float_count);) stbIF5(
12118-			            output5 += (4 * stbir__simdfX_float_count);)
12119-			            stbIF6(output6 += (4 * stbir__simdfX_float_count);)
12120-			                stbIF7(output7 += (4 * stbir__simdfX_float_count);)
12121-		}
12122-		STBIR_SIMD_NO_UNROLL_LOOP_START
12123-		while (((char *)input_end - (char *)input) >= 16) {
12124-			stbir__simdf o0, r0;
12125-			STBIR_SIMD_NO_UNROLL(output0);
12126-
12127-			stbir__simdf_load(r0, input);
12128-
12129-#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
12130-			stbIF0(stbir__simdf_load(o0, output0); stbir__simdf_madd(
12131-			           o0, o0, r0, stbir__if_simdf8_cast_to_simdf4(c0));
12132-			       stbir__simdf_store(
12133-			           output0,
12134-			           o0);) stbIF1(stbir__simdf_load(o0, output1);
12135-			                        stbir__simdf_madd(
12136-			                            o0,
12137-			                            o0,
12138-			                            r0,
12139-			                            stbir__if_simdf8_cast_to_simdf4(c1));
12140-			                        stbir__simdf_store(output1, o0);)
12141-			    stbIF2(stbir__simdf_load(o0, output2); stbir__simdf_madd(
12142-			               o0, o0, r0, stbir__if_simdf8_cast_to_simdf4(c2));
12143-			           stbir__simdf_store(output2, o0);)
12144-			        stbIF3(stbir__simdf_load(o0, output3); stbir__simdf_madd(
12145-			                   o0, o0, r0, stbir__if_simdf8_cast_to_simdf4(c3));
12146-			               stbir__simdf_store(output3, o0);)
12147-			            stbIF4(stbir__simdf_load(o0, output4);
12148-			                   stbir__simdf_madd(
12149-			                       o0,
12150-			                       o0,
12151-			                       r0,
12152-			                       stbir__if_simdf8_cast_to_simdf4(c4));
12153-			                   stbir__simdf_store(output4, o0);)
12154-			                stbIF5(stbir__simdf_load(o0, output5);
12155-			                       stbir__simdf_madd(
12156-			                           o0,
12157-			                           o0,
12158-			                           r0,
12159-			                           stbir__if_simdf8_cast_to_simdf4(c5));
12160-			                       stbir__simdf_store(output5, o0);)
12161-			                    stbIF6(stbir__simdf_load(o0, output6);
12162-			                           stbir__simdf_madd(
12163-			                               o0,
12164-			                               o0,
12165-			                               r0,
12166-			                               stbir__if_simdf8_cast_to_simdf4(c6));
12167-			                           stbir__simdf_store(output6, o0);)
12168-			                        stbIF7(stbir__simdf_load(o0, output7);
12169-			                               stbir__simdf_madd(
12170-			                                   o0,
12171-			                                   o0,
12172-			                                   r0,
12173-			                                   stbir__if_simdf8_cast_to_simdf4(
12174-			                                       c7));
12175-			                               stbir__simdf_store(output7, o0);)
12176-#else
12177-			stbIF0(
12178-			    stbir__simdf_mult(o0, r0, stbir__if_simdf8_cast_to_simdf4(c0));
12179-			    stbir__simdf_store(output0, o0);)
12180-			    stbIF1(stbir__simdf_mult(
12181-			               o0, r0, stbir__if_simdf8_cast_to_simdf4(c1));
12182-			           stbir__simdf_store(output1, o0);)
12183-			        stbIF2(stbir__simdf_mult(
12184-			                   o0, r0, stbir__if_simdf8_cast_to_simdf4(c2));
12185-			               stbir__simdf_store(output2, o0);)
12186-			            stbIF3(stbir__simdf_mult(
12187-			                       o0, r0, stbir__if_simdf8_cast_to_simdf4(c3));
12188-			                   stbir__simdf_store(output3, o0);)
12189-			                stbIF4(stbir__simdf_mult(
12190-			                           o0,
12191-			                           r0,
12192-			                           stbir__if_simdf8_cast_to_simdf4(c4));
12193-			                       stbir__simdf_store(output4, o0);)
12194-			                    stbIF5(stbir__simdf_mult(
12195-			                               o0,
12196-			                               r0,
12197-			                               stbir__if_simdf8_cast_to_simdf4(c5));
12198-			                           stbir__simdf_store(output5, o0);)
12199-			                        stbIF6(stbir__simdf_mult(
12200-			                                   o0,
12201-			                                   r0,
12202-			                                   stbir__if_simdf8_cast_to_simdf4(
12203-			                                       c6));
12204-			                               stbir__simdf_store(output6, o0);)
12205-			                            stbIF7(
12206-			                                stbir__simdf_mult(
12207-			                                    o0,
12208-			                                    r0,
12209-			                                    stbir__if_simdf8_cast_to_simdf4(
12210-			                                        c7));
12211-			                                stbir__simdf_store(output7, o0);)
12212-#endif
12213-
12214-			                            input += 4;
12215-			stbIF0(output0 += 4;) stbIF1(output1 += 4;) stbIF2(output2 += 4;)
12216-			    stbIF3(output3 += 4;) stbIF4(output4 += 4;)
12217-			        stbIF5(output5 += 4;) stbIF6(output6 += 4;)
12218-			            stbIF7(output7 += 4;)
12219-		}
12220-	}
12221-#else
12222-	                                STBIR_NO_UNROLL_LOOP_START while (
12223-	                                    ((char *)input_end - (char *)input) >=
12224-	                                    16)
12225-	{
12226-		float r0, r1, r2, r3;
12227-		STBIR_NO_UNROLL(input);
12228-
12229-		r0 = input[0], r1 = input[1], r2 = input[2], r3 = input[3];
12230-
12231-#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
12232-		stbIF0(output0[0] += (r0 * c0s); output0[1] += (r1 * c0s);
12233-		       output0[2] += (r2 * c0s); output0[3] += (r3 * c0s);)
12234-		    stbIF1(output1[0] += (r0 * c1s); output1[1] += (r1 * c1s);
12235-		           output1[2] += (r2 * c1s); output1[3] += (r3 * c1s);)
12236-		        stbIF2(output2[0] += (r0 * c2s); output2[1] += (r1 * c2s);
12237-		               output2[2] += (r2 * c2s); output2[3] += (r3 * c2s);)
12238-		            stbIF3(output3[0] += (r0 * c3s); output3[1] += (r1 * c3s);
12239-		                   output3[2] += (r2 * c3s); output3[3] += (r3 * c3s);)
12240-		                stbIF4(
12241-		                    output4[0] += (r0 * c4s); output4[1] += (r1 * c4s);
12242-		                    output4[2] += (r2 * c4s); output4[3] += (r3 * c4s);)
12243-		                    stbIF5(output5[0] += (r0 * c5s);
12244-		                           output5[1] += (r1 * c5s);
12245-		                           output5[2] += (r2 * c5s);
12246-		                           output5[3] += (r3 * c5s);)
12247-		                        stbIF6(output6[0] += (r0 * c6s);
12248-		                               output6[1] += (r1 * c6s);
12249-		                               output6[2] += (r2 * c6s);
12250-		                               output6[3] += (r3 * c6s);)
12251-		                            stbIF7(output7[0] += (r0 * c7s);
12252-		                                   output7[1] += (r1 * c7s);
12253-		                                   output7[2] += (r2 * c7s);
12254-		                                   output7[3] += (r3 * c7s);)
12255-#else
12256-		stbIF0(output0[0] = (r0 * c0s); output0[1] = (r1 * c0s);
12257-		       output0[2] = (r2 * c0s); output0[3] = (r3 * c0s);)
12258-		    stbIF1(output1[0] = (r0 * c1s); output1[1] = (r1 * c1s);
12259-		           output1[2] = (r2 * c1s); output1[3] = (r3 * c1s);)
12260-		        stbIF2(output2[0] = (r0 * c2s); output2[1] = (r1 * c2s);
12261-		               output2[2] = (r2 * c2s); output2[3] = (r3 * c2s);)
12262-		            stbIF3(output3[0] = (r0 * c3s); output3[1] = (r1 * c3s);
12263-		                   output3[2] = (r2 * c3s); output3[3] = (r3 * c3s);)
12264-		                stbIF4(output4[0] = (r0 * c4s); output4[1] = (r1 * c4s);
12265-		                       output4[2] = (r2 * c4s);
12266-		                       output4[3] = (r3 * c4s);)
12267-		                    stbIF5(output5[0] = (r0 * c5s);
12268-		                           output5[1] = (r1 * c5s);
12269-		                           output5[2] = (r2 * c5s);
12270-		                           output5[3] = (r3 * c5s);)
12271-		                        stbIF6(output6[0] = (r0 * c6s);
12272-		                               output6[1] = (r1 * c6s);
12273-		                               output6[2] = (r2 * c6s);
12274-		                               output6[3] = (r3 * c6s);)
12275-		                            stbIF7(output7[0] = (r0 * c7s);
12276-		                                   output7[1] = (r1 * c7s);
12277-		                                   output7[2] = (r2 * c7s);
12278-		                                   output7[3] = (r3 * c7s);)
12279-#endif
12280-
12281-		                                input += 4;
12282-		stbIF0(output0 += 4;) stbIF1(output1 += 4;) stbIF2(output2 += 4;)
12283-		    stbIF3(output3 += 4;) stbIF4(output4 += 4;) stbIF5(output5 += 4;)
12284-		        stbIF6(output6 += 4;) stbIF7(output7 += 4;)
12285-	}
12286-#endif
12287-	STBIR_NO_UNROLL_LOOP_START
12288-	while (input < input_end) {
12289-		float r = input[0];
12290-		STBIR_NO_UNROLL(output0);
12291-
12292-#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
12293-		stbIF0(output0[0] += (r * c0s);) stbIF1(output1[0] += (r * c1s);)
12294-		    stbIF2(output2[0] += (r * c2s);) stbIF3(output3[0] += (r * c3s);)
12295-		        stbIF4(output4[0] += (r * c4s);)
12296-		            stbIF5(output5[0] += (r * c5s);)
12297-		                stbIF6(output6[0] += (r * c6s);)
12298-		                    stbIF7(output7[0] += (r * c7s);)
12299-#else
12300-		stbIF0(output0[0] = (r * c0s);) stbIF1(output1[0] = (r * c1s);)
12301-		    stbIF2(output2[0] = (r * c2s);) stbIF3(output3[0] = (r * c3s);)
12302-		        stbIF4(output4[0] = (r * c4s);) stbIF5(output5[0] = (r * c5s);)
12303-		            stbIF6(output6[0] = (r * c6s);)
12304-		                stbIF7(output7[0] = (r * c7s);)
12305-#endif
12306-
12307-		                        ++ input;
12308-		stbIF0(++output0;) stbIF1(++output1;) stbIF2(++output2;)
12309-		    stbIF3(++output3;) stbIF4(++output4;) stbIF5(++output5;)
12310-		        stbIF6(++output6;) stbIF7(++output7;)
12311-	}
12312-}
12313-
12314-static void
12315-STBIR_chans(stbir__vertical_gather_with_,
12316-            _coeffs)(float *outputp,
12317-                     float const *vertical_coefficients,
12318-                     float const **inputs,
12319-                     float const *input0_end)
12320-{
12321-	float STBIR_SIMD_STREAMOUT_PTR(*) output = outputp;
12322-
12323-	stbIF0(float const *input0 = inputs[0];
12324-	       float c0s = vertical_coefficients[0];)
12325-	    stbIF1(float const *input1 = inputs[1];
12326-	           float c1s = vertical_coefficients[1];)
12327-	        stbIF2(float const *input2 = inputs[2];
12328-	               float c2s = vertical_coefficients[2];)
12329-	            stbIF3(float const *input3 = inputs[3];
12330-	                   float c3s = vertical_coefficients[3];)
12331-	                stbIF4(float const *input4 = inputs[4];
12332-	                       float c4s = vertical_coefficients[4];)
12333-	                    stbIF5(float const *input5 = inputs[5];
12334-	                           float c5s = vertical_coefficients[5];)
12335-	                        stbIF6(float const *input6 = inputs[6];
12336-	                               float c6s = vertical_coefficients[6];)
12337-	                            stbIF7(float const *input7 = inputs[7];
12338-	                                   float c7s = vertical_coefficients[7];)
12339-
12340-#if (STBIR__vertical_channels == 1) &&                                         \
12341-    !defined(STB_IMAGE_RESIZE_VERTICAL_CONTINUE)
12342-	    // check single channel one weight
12343-	    if ((c0s >= (1.0f - 0.000001f)) && (c0s <= (1.0f + 0.000001f)))
12344-	{
12345-		STBIR_MEMCPY(output, input0, (char *)input0_end - (char *)input0);
12346-		return;
12347-	}
12348-#endif
12349-
12350-#ifdef STBIR_SIMD
12351-	{
12352-		stbIF0(stbir__simdfX c0 = stbir__simdf_frepX(c0s);)
12353-		    stbIF1(stbir__simdfX c1 = stbir__simdf_frepX(c1s);)
12354-		        stbIF2(stbir__simdfX c2 = stbir__simdf_frepX(c2s);) stbIF3(
12355-		            stbir__simdfX c3 = stbir__simdf_frepX(c3s);)
12356-		            stbIF4(stbir__simdfX c4 = stbir__simdf_frepX(c4s);) stbIF5(
12357-		                stbir__simdfX c5 = stbir__simdf_frepX(c5s);)
12358-		                stbIF6(stbir__simdfX c6 = stbir__simdf_frepX(c6s);)
12359-		                    stbIF7(stbir__simdfX c7 = stbir__simdf_frepX(c7s);)
12360-
12361-		                        STBIR_SIMD_NO_UNROLL_LOOP_START while (
12362-		                            ((char *)input0_end - (char *)input0) >=
12363-		                            (16 * stbir__simdfX_float_count))
12364-		{
12365-			stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3;
12366-			STBIR_SIMD_NO_UNROLL(output);
12367-
12368-			// prefetch four loop iterations ahead (doesn't affect much for
12369-			// small resizes, but helps with big ones)
12370-			stbIF0(stbir__prefetch(input0 + (16 * stbir__simdfX_float_count));) stbIF1(
12371-			    stbir__prefetch(input1 + (16 * stbir__simdfX_float_count));)
12372-			    stbIF2(stbir__prefetch(input2 + (16 * stbir__simdfX_float_count));) stbIF3(
12373-			        stbir__prefetch(input3 + (16 * stbir__simdfX_float_count));)
12374-			        stbIF4(stbir__prefetch(input4 + (16 * stbir__simdfX_float_count));) stbIF5(
12375-			            stbir__prefetch(input5 +
12376-			                            (16 * stbir__simdfX_float_count));)
12377-			            stbIF6(stbir__prefetch(input6 + (16 * stbir__simdfX_float_count));) stbIF7(
12378-			                stbir__prefetch(input7 +
12379-			                                (16 * stbir__simdfX_float_count));)
12380-
12381-#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
12382-			                stbIF0(
12383-			                    stbir__simdfX_load(o0, output);
12384-			                    stbir__simdfX_load(
12385-			                        o1, output + stbir__simdfX_float_count);
12386-			                    stbir__simdfX_load(
12387-			                        o2,
12388-			                        output + (2 * stbir__simdfX_float_count));
12389-			                    stbir__simdfX_load(
12390-			                        o3,
12391-			                        output + (3 * stbir__simdfX_float_count));
12392-			                    stbir__simdfX_load(r0, input0);
12393-			                    stbir__simdfX_load(
12394-			                        r1, input0 + stbir__simdfX_float_count);
12395-			                    stbir__simdfX_load(
12396-			                        r2,
12397-			                        input0 + (2 * stbir__simdfX_float_count));
12398-			                    stbir__simdfX_load(
12399-			                        r3,
12400-			                        input0 + (3 * stbir__simdfX_float_count));
12401-			                    stbir__simdfX_madd(o0, o0, r0, c0);
12402-			                    stbir__simdfX_madd(o1, o1, r1, c0);
12403-			                    stbir__simdfX_madd(o2, o2, r2, c0);
12404-			                    stbir__simdfX_madd(o3, o3, r3, c0);)
12405-#else
12406-			                stbIF0(
12407-			                    stbir__simdfX_load(r0, input0);
12408-			                    stbir__simdfX_load(
12409-			                        r1, input0 + stbir__simdfX_float_count);
12410-			                    stbir__simdfX_load(
12411-			                        r2,
12412-			                        input0 + (2 * stbir__simdfX_float_count));
12413-			                    stbir__simdfX_load(
12414-			                        r3,
12415-			                        input0 + (3 * stbir__simdfX_float_count));
12416-			                    stbir__simdfX_mult(o0, r0, c0);
12417-			                    stbir__simdfX_mult(o1, r1, c0);
12418-			                    stbir__simdfX_mult(o2, r2, c0);
12419-			                    stbir__simdfX_mult(o3, r3, c0);)
12420-#endif
12421-
12422-			                    stbIF1(
12423-			                        stbir__simdfX_load(r0, input1);
12424-			                        stbir__simdfX_load(
12425-			                            r1, input1 + stbir__simdfX_float_count);
12426-			                        stbir__simdfX_load(
12427-			                            r2,
12428-			                            input1 +
12429-			                                (2 * stbir__simdfX_float_count));
12430-			                        stbir__simdfX_load(
12431-			                            r3,
12432-			                            input1 +
12433-			                                (3 * stbir__simdfX_float_count));
12434-			                        stbir__simdfX_madd(o0, o0, r0, c1);
12435-			                        stbir__simdfX_madd(o1, o1, r1, c1);
12436-			                        stbir__simdfX_madd(o2, o2, r2, c1);
12437-			                        stbir__simdfX_madd(
12438-			                            o3,
12439-			                            o3,
12440-			                            r3,
12441-			                            c1);) stbIF2(stbir__simdfX_load(r0,
12442-			                                                            input2);
12443-			                                         stbir__simdfX_load(
12444-			                                             r1,
12445-			                                             input2 +
12446-			                                                 stbir__simdfX_float_count);
12447-			                                         stbir__simdfX_load(
12448-			                                             r2,
12449-			                                             input2 +
12450-			                                                 (2 *
12451-			                                                  stbir__simdfX_float_count));
12452-			                                         stbir__simdfX_load(
12453-			                                             r3,
12454-			                                             input2 +
12455-			                                                 (3 *
12456-			                                                  stbir__simdfX_float_count));
12457-			                                         stbir__simdfX_madd(
12458-			                                             o0, o0, r0, c2);
12459-			                                         stbir__simdfX_madd(
12460-			                                             o1, o1, r1, c2);
12461-			                                         stbir__simdfX_madd(
12462-			                                             o2, o2, r2, c2);
12463-			                                         stbir__simdfX_madd(
12464-			                                             o3, o3, r3, c2);)
12465-			                        stbIF3(
12466-			                            stbir__simdfX_load(r0, input3);
12467-			                            stbir__simdfX_load(
12468-			                                r1,
12469-			                                input3 + stbir__simdfX_float_count);
12470-			                            stbir__simdfX_load(
12471-			                                r2,
12472-			                                input3 +
12473-			                                    (2 *
12474-			                                     stbir__simdfX_float_count));
12475-			                            stbir__simdfX_load(
12476-			                                r3,
12477-			                                input3 +
12478-			                                    (3 *
12479-			                                     stbir__simdfX_float_count));
12480-			                            stbir__simdfX_madd(o0, o0, r0, c3);
12481-			                            stbir__simdfX_madd(o1, o1, r1, c3);
12482-			                            stbir__simdfX_madd(o2, o2, r2, c3);
12483-			                            stbir__simdfX_madd(o3, o3, r3, c3);)
12484-			                            stbIF4(
12485-			                                stbir__simdfX_load(r0, input4);
12486-			                                stbir__simdfX_load(
12487-			                                    r1,
12488-			                                    input4 +
12489-			                                        stbir__simdfX_float_count);
12490-			                                stbir__simdfX_load(
12491-			                                    r2,
12492-			                                    input4 +
12493-			                                        (2 *
12494-			                                         stbir__simdfX_float_count));
12495-			                                stbir__simdfX_load(
12496-			                                    r3,
12497-			                                    input4 +
12498-			                                        (3 *
12499-			                                         stbir__simdfX_float_count));
12500-			                                stbir__simdfX_madd(o0, o0, r0, c4);
12501-			                                stbir__simdfX_madd(o1, o1, r1, c4);
12502-			                                stbir__simdfX_madd(o2, o2, r2, c4);
12503-			                                stbir__simdfX_madd(o3, o3, r3, c4);)
12504-			                                stbIF5(
12505-			                                    stbir__simdfX_load(r0, input5);
12506-			                                    stbir__simdfX_load(
12507-			                                        r1,
12508-			                                        input5 +
12509-			                                            stbir__simdfX_float_count);
12510-			                                    stbir__simdfX_load(
12511-			                                        r2,
12512-			                                        input5 +
12513-			                                            (2 *
12514-			                                             stbir__simdfX_float_count));
12515-			                                    stbir__simdfX_load(
12516-			                                        r3,
12517-			                                        input5 +
12518-			                                            (3 *
12519-			                                             stbir__simdfX_float_count));
12520-			                                    stbir__simdfX_madd(
12521-			                                        o0, o0, r0, c5);
12522-			                                    stbir__simdfX_madd(
12523-			                                        o1, o1, r1, c5);
12524-			                                    stbir__simdfX_madd(
12525-			                                        o2, o2, r2, c5);
12526-			                                    stbir__simdfX_madd(
12527-			                                        o3, o3, r3, c5);)
12528-			                                    stbIF6(
12529-			                                        stbir__simdfX_load(r0,
12530-			                                                           input6);
12531-			                                        stbir__simdfX_load(
12532-			                                            r1,
12533-			                                            input6 +
12534-			                                                stbir__simdfX_float_count);
12535-			                                        stbir__simdfX_load(
12536-			                                            r2,
12537-			                                            input6 +
12538-			                                                (2 *
12539-			                                                 stbir__simdfX_float_count));
12540-			                                        stbir__simdfX_load(
12541-			                                            r3,
12542-			                                            input6 +
12543-			                                                (3 *
12544-			                                                 stbir__simdfX_float_count));
12545-			                                        stbir__simdfX_madd(
12546-			                                            o0, o0, r0, c6);
12547-			                                        stbir__simdfX_madd(
12548-			                                            o1, o1, r1, c6);
12549-			                                        stbir__simdfX_madd(
12550-			                                            o2, o2, r2, c6);
12551-			                                        stbir__simdfX_madd(
12552-			                                            o3, o3, r3, c6);)
12553-			                                        stbIF7(
12554-			                                            stbir__simdfX_load(
12555-			                                                r0, input7);
12556-			                                            stbir__simdfX_load(
12557-			                                                r1,
12558-			                                                input7 +
12559-			                                                    stbir__simdfX_float_count);
12560-			                                            stbir__simdfX_load(
12561-			                                                r2,
12562-			                                                input7 +
12563-			                                                    (2 *
12564-			                                                     stbir__simdfX_float_count));
12565-			                                            stbir__simdfX_load(
12566-			                                                r3,
12567-			                                                input7 +
12568-			                                                    (3 *
12569-			                                                     stbir__simdfX_float_count));
12570-			                                            stbir__simdfX_madd(
12571-			                                                o0, o0, r0, c7);
12572-			                                            stbir__simdfX_madd(
12573-			                                                o1, o1, r1, c7);
12574-			                                            stbir__simdfX_madd(
12575-			                                                o2, o2, r2, c7);
12576-			                                            stbir__simdfX_madd(
12577-			                                                o3, o3, r3, c7);)
12578-
12579-			                                            stbir__simdfX_store(
12580-			                                                output, o0);
12581-			stbir__simdfX_store(output + stbir__simdfX_float_count, o1);
12582-			stbir__simdfX_store(output + (2 * stbir__simdfX_float_count), o2);
12583-			stbir__simdfX_store(output + (3 * stbir__simdfX_float_count), o3);
12584-			output += (4 * stbir__simdfX_float_count);
12585-			stbIF0(input0 += (4 * stbir__simdfX_float_count);) stbIF1(
12586-			    input1 += (4 * stbir__simdfX_float_count);)
12587-			    stbIF2(input2 += (4 * stbir__simdfX_float_count);) stbIF3(
12588-			        input3 += (4 * stbir__simdfX_float_count);)
12589-			        stbIF4(input4 += (4 * stbir__simdfX_float_count);) stbIF5(
12590-			            input5 += (4 * stbir__simdfX_float_count);)
12591-			            stbIF6(input6 += (4 * stbir__simdfX_float_count);)
12592-			                stbIF7(input7 += (4 * stbir__simdfX_float_count);)
12593-		}
12594-
12595-		STBIR_SIMD_NO_UNROLL_LOOP_START
12596-		while (((char *)input0_end - (char *)input0) >= 16) {
12597-			stbir__simdf o0, r0;
12598-			STBIR_SIMD_NO_UNROLL(output);
12599-
12600-#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
12601-			stbIF0(stbir__simdf_load(o0, output); stbir__simdf_load(r0, input0);
12602-			       stbir__simdf_madd(
12603-			           o0, o0, r0, stbir__if_simdf8_cast_to_simdf4(c0));)
12604-#else
12605-			stbIF0(stbir__simdf_load(r0, input0); stbir__simdf_mult(
12606-			           o0, r0, stbir__if_simdf8_cast_to_simdf4(c0));)
12607-#endif
12608-			    stbIF1(stbir__simdf_load(r0, input1); stbir__simdf_madd(
12609-			               o0, o0, r0, stbir__if_simdf8_cast_to_simdf4(c1));)
12610-			        stbIF2(
12611-			            stbir__simdf_load(r0, input2); stbir__simdf_madd(
12612-			                o0, o0, r0, stbir__if_simdf8_cast_to_simdf4(c2));)
12613-			            stbIF3(stbir__simdf_load(r0, input3); stbir__simdf_madd(
12614-			                       o0,
12615-			                       o0,
12616-			                       r0,
12617-			                       stbir__if_simdf8_cast_to_simdf4(c3));)
12618-			                stbIF4(stbir__simdf_load(r0, input4);
12619-			                       stbir__simdf_madd(
12620-			                           o0,
12621-			                           o0,
12622-			                           r0,
12623-			                           stbir__if_simdf8_cast_to_simdf4(c4));)
12624-			                    stbIF5(
12625-			                        stbir__simdf_load(r0, input5);
12626-			                        stbir__simdf_madd(
12627-			                            o0,
12628-			                            o0,
12629-			                            r0,
12630-			                            stbir__if_simdf8_cast_to_simdf4(c5));)
12631-			                        stbIF6(stbir__simdf_load(r0, input6);
12632-			                               stbir__simdf_madd(
12633-			                                   o0,
12634-			                                   o0,
12635-			                                   r0,
12636-			                                   stbir__if_simdf8_cast_to_simdf4(
12637-			                                       c6));)
12638-			                            stbIF7(
12639-			                                stbir__simdf_load(r0, input7);
12640-			                                stbir__simdf_madd(
12641-			                                    o0,
12642-			                                    o0,
12643-			                                    r0,
12644-			                                    stbir__if_simdf8_cast_to_simdf4(
12645-			                                        c7));)
12646-
12647-			                                stbir__simdf_store(output, o0);
12648-			output += 4;
12649-			stbIF0(input0 += 4;) stbIF1(input1 += 4;) stbIF2(input2 += 4;)
12650-			    stbIF3(input3 += 4;) stbIF4(input4 += 4;) stbIF5(input5 += 4;)
12651-			        stbIF6(input6 += 4;) stbIF7(input7 += 4;)
12652-		}
12653-	}
12654-#else
12655-	                                STBIR_NO_UNROLL_LOOP_START while (
12656-	                                    ((char *)input0_end - (char *)input0) >=
12657-	                                    16)
12658-	{
12659-		float o0, o1, o2, o3;
12660-		STBIR_NO_UNROLL(output);
12661-#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
12662-		stbIF0(
12663-		    o0 = output[0] + input0[0] * c0s; o1 = output[1] + input0[1] * c0s;
12664-		    o2 = output[2] + input0[2] * c0s; o3 = output[3] + input0[3] * c0s;)
12665-#else
12666-		stbIF0(o0 = input0[0] * c0s; o1 = input0[1] * c0s; o2 = input0[2] * c0s;
12667-		       o3 = input0[3] * c0s;)
12668-#endif
12669-		    stbIF1(o0 += input1[0] * c1s; o1 += input1[1] * c1s;
12670-		           o2 += input1[2] * c1s; o3 += input1[3] * c1s;)
12671-		        stbIF2(o0 += input2[0] * c2s; o1 += input2[1] * c2s;
12672-		               o2 += input2[2] * c2s;
12673-		               o3 += input2[3] * c2s;) stbIF3(o0 += input3[0] * c3s;
12674-		                                              o1 += input3[1] * c3s;
12675-		                                              o2 += input3[2] * c3s;
12676-		                                              o3 += input3[3] * c3s;)
12677-		            stbIF4(o0 += input4[0] * c4s; o1 += input4[1] * c4s;
12678-		                   o2 += input4[2] * c4s; o3 += input4[3] * c4s;)
12679-		                stbIF5(o0 += input5[0] * c5s; o1 += input5[1] * c5s;
12680-		                       o2 += input5[2] * c5s; o3 += input5[3] * c5s;)
12681-		                    stbIF6(o0 += input6[0] * c6s; o1 += input6[1] * c6s;
12682-		                           o2 += input6[2] * c6s;
12683-		                           o3 += input6[3] * c6s;)
12684-		                        stbIF7(o0 += input7[0] * c7s;
12685-		                               o1 += input7[1] * c7s;
12686-		                               o2 += input7[2] * c7s;
12687-		                               o3 += input7[3] * c7s;) output[0] = o0;
12688-		output[1] = o1;
12689-		output[2] = o2;
12690-		output[3] = o3;
12691-		output += 4;
12692-		stbIF0(input0 += 4;) stbIF1(input1 += 4;) stbIF2(input2 += 4;)
12693-		    stbIF3(input3 += 4;) stbIF4(input4 += 4;) stbIF5(input5 += 4;)
12694-		        stbIF6(input6 += 4;) stbIF7(input7 += 4;)
12695-	}
12696-#endif
12697-	STBIR_NO_UNROLL_LOOP_START
12698-	while (input0 < input0_end) {
12699-		float o0;
12700-		STBIR_NO_UNROLL(output);
12701-#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
12702-		stbIF0(o0 = output[0] + input0[0] * c0s;)
12703-#else
12704-		stbIF0(o0 = input0[0] * c0s;)
12705-#endif
12706-		    stbIF1(o0 += input1[0] * c1s;) stbIF2(o0 += input2[0] * c2s;)
12707-		        stbIF3(o0 += input3[0] * c3s;) stbIF4(o0 += input4[0] * c4s;)
12708-		            stbIF5(o0 += input5[0] * c5s;)
12709-		                stbIF6(o0 += input6[0] * c6s;)
12710-		                    stbIF7(o0 += input7[0] * c7s;) output[0] = o0;
12711-		++output;
12712-		stbIF0(++input0;) stbIF1(++input1;) stbIF2(++input2;) stbIF3(++input3;)
12713-		    stbIF4(++input4;) stbIF5(++input5;) stbIF6(++input6;)
12714-		        stbIF7(++input7;)
12715-	}
12716-}
12717-
12718-#undef stbIF0
12719-#undef stbIF1
12720-#undef stbIF2
12721-#undef stbIF3
12722-#undef stbIF4
12723-#undef stbIF5
12724-#undef stbIF6
12725-#undef stbIF7
12726-#undef STB_IMAGE_RESIZE_DO_VERTICALS
12727-#undef STBIR__vertical_channels
12728-#undef STB_IMAGE_RESIZE_DO_HORIZONTALS
12729-#undef STBIR_strs_join24
12730-#undef STBIR_strs_join14
12731-#undef STBIR_chans
12732-#ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
12733-#undef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
12734-#endif
12735-
12736-#else // !STB_IMAGE_RESIZE_DO_VERTICALS
12737-
12738-#define STBIR_chans(start, end)                                                \
12739-	STBIR_strs_join1(start, STBIR__horizontal_channels, end)
12740-
12741-#ifndef stbir__2_coeff_only
12742-#define stbir__2_coeff_only()                                                  \
12743-	stbir__1_coeff_only();                                                     \
12744-	stbir__1_coeff_remnant(1);
12745-#endif
12746-
12747-#ifndef stbir__2_coeff_remnant
12748-#define stbir__2_coeff_remnant(ofs)                                            \
12749-	stbir__1_coeff_remnant(ofs);                                               \
12750-	stbir__1_coeff_remnant((ofs) + 1);
12751-#endif
12752-
12753-#ifndef stbir__3_coeff_only
12754-#define stbir__3_coeff_only()                                                  \
12755-	stbir__2_coeff_only();                                                     \
12756-	stbir__1_coeff_remnant(2);
12757-#endif
12758-
12759-#ifndef stbir__3_coeff_remnant
12760-#define stbir__3_coeff_remnant(ofs)                                            \
12761-	stbir__2_coeff_remnant(ofs);                                               \
12762-	stbir__1_coeff_remnant((ofs) + 2);
12763-#endif
12764-
12765-#ifndef stbir__3_coeff_setup
12766-#define stbir__3_coeff_setup()
12767-#endif
12768-
12769-#ifndef stbir__4_coeff_start
12770-#define stbir__4_coeff_start()                                                 \
12771-	stbir__2_coeff_only();                                                     \
12772-	stbir__2_coeff_remnant(2);
12773-#endif
12774-
12775-#ifndef stbir__4_coeff_continue_from_4
12776-#define stbir__4_coeff_continue_from_4(ofs)                                    \
12777-	stbir__2_coeff_remnant(ofs);                                               \
12778-	stbir__2_coeff_remnant((ofs) + 2);
12779-#endif
12780-
12781-#ifndef stbir__store_output_tiny
12782-#define stbir__store_output_tiny stbir__store_output
12783-#endif
12784-
12785-static void
12786-STBIR_chans(stbir__horizontal_gather_, _channels_with_1_coeff)(
12787-    float *output_buffer, unsigned int output_sub_size,
12788-    float const *decode_buffer,
12789-    stbir__contributors const *horizontal_contributors,
12790-    float const *horizontal_coefficients, int coefficient_width)
12791-{
12792-	float const *output_end =
12793-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
12794-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
12795-	STBIR_SIMD_NO_UNROLL_LOOP_START
12796-	do {
12797-		float const *decode = decode_buffer + horizontal_contributors->n0 *
12798-		                                          STBIR__horizontal_channels;
12799-		float const *hc = horizontal_coefficients;
12800-		stbir__1_coeff_only();
12801-		stbir__store_output_tiny();
12802-	} while (output < output_end);
12803-}
12804-
12805-static void
12806-STBIR_chans(stbir__horizontal_gather_, _channels_with_2_coeffs)(
12807-    float *output_buffer, unsigned int output_sub_size,
12808-    float const *decode_buffer,
12809-    stbir__contributors const *horizontal_contributors,
12810-    float const *horizontal_coefficients, int coefficient_width)
12811-{
12812-	float const *output_end =
12813-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
12814-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
12815-	STBIR_SIMD_NO_UNROLL_LOOP_START
12816-	do {
12817-		float const *decode = decode_buffer + horizontal_contributors->n0 *
12818-		                                          STBIR__horizontal_channels;
12819-		float const *hc = horizontal_coefficients;
12820-		stbir__2_coeff_only();
12821-		stbir__store_output_tiny();
12822-	} while (output < output_end);
12823-}
12824-
12825-static void
12826-STBIR_chans(stbir__horizontal_gather_, _channels_with_3_coeffs)(
12827-    float *output_buffer, unsigned int output_sub_size,
12828-    float const *decode_buffer,
12829-    stbir__contributors const *horizontal_contributors,
12830-    float const *horizontal_coefficients, int coefficient_width)
12831-{
12832-	float const *output_end =
12833-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
12834-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
12835-	STBIR_SIMD_NO_UNROLL_LOOP_START
12836-	do {
12837-		float const *decode = decode_buffer + horizontal_contributors->n0 *
12838-		                                          STBIR__horizontal_channels;
12839-		float const *hc = horizontal_coefficients;
12840-		stbir__3_coeff_only();
12841-		stbir__store_output_tiny();
12842-	} while (output < output_end);
12843-}
12844-
12845-static void
12846-STBIR_chans(stbir__horizontal_gather_, _channels_with_4_coeffs)(
12847-    float *output_buffer, unsigned int output_sub_size,
12848-    float const *decode_buffer,
12849-    stbir__contributors const *horizontal_contributors,
12850-    float const *horizontal_coefficients, int coefficient_width)
12851-{
12852-	float const *output_end =
12853-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
12854-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
12855-	STBIR_SIMD_NO_UNROLL_LOOP_START
12856-	do {
12857-		float const *decode = decode_buffer + horizontal_contributors->n0 *
12858-		                                          STBIR__horizontal_channels;
12859-		float const *hc = horizontal_coefficients;
12860-		stbir__4_coeff_start();
12861-		stbir__store_output();
12862-	} while (output < output_end);
12863-}
12864-
12865-static void
12866-STBIR_chans(stbir__horizontal_gather_, _channels_with_5_coeffs)(
12867-    float *output_buffer, unsigned int output_sub_size,
12868-    float const *decode_buffer,
12869-    stbir__contributors const *horizontal_contributors,
12870-    float const *horizontal_coefficients, int coefficient_width)
12871-{
12872-	float const *output_end =
12873-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
12874-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
12875-	STBIR_SIMD_NO_UNROLL_LOOP_START
12876-	do {
12877-		float const *decode = decode_buffer + horizontal_contributors->n0 *
12878-		                                          STBIR__horizontal_channels;
12879-		float const *hc = horizontal_coefficients;
12880-		stbir__4_coeff_start();
12881-		stbir__1_coeff_remnant(4);
12882-		stbir__store_output();
12883-	} while (output < output_end);
12884-}
12885-
12886-static void
12887-STBIR_chans(stbir__horizontal_gather_, _channels_with_6_coeffs)(
12888-    float *output_buffer, unsigned int output_sub_size,
12889-    float const *decode_buffer,
12890-    stbir__contributors const *horizontal_contributors,
12891-    float const *horizontal_coefficients, int coefficient_width)
12892-{
12893-	float const *output_end =
12894-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
12895-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
12896-	STBIR_SIMD_NO_UNROLL_LOOP_START
12897-	do {
12898-		float const *decode = decode_buffer + horizontal_contributors->n0 *
12899-		                                          STBIR__horizontal_channels;
12900-		float const *hc = horizontal_coefficients;
12901-		stbir__4_coeff_start();
12902-		stbir__2_coeff_remnant(4);
12903-		stbir__store_output();
12904-	} while (output < output_end);
12905-}
12906-
12907-static void
12908-STBIR_chans(stbir__horizontal_gather_, _channels_with_7_coeffs)(
12909-    float *output_buffer, unsigned int output_sub_size,
12910-    float const *decode_buffer,
12911-    stbir__contributors const *horizontal_contributors,
12912-    float const *horizontal_coefficients, int coefficient_width)
12913-{
12914-	float const *output_end =
12915-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
12916-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
12917-	stbir__3_coeff_setup();
12918-	STBIR_SIMD_NO_UNROLL_LOOP_START
12919-	do {
12920-		float const *decode = decode_buffer + horizontal_contributors->n0 *
12921-		                                          STBIR__horizontal_channels;
12922-		float const *hc = horizontal_coefficients;
12923-
12924-		stbir__4_coeff_start();
12925-		stbir__3_coeff_remnant(4);
12926-		stbir__store_output();
12927-	} while (output < output_end);
12928-}
12929-
12930-static void
12931-STBIR_chans(stbir__horizontal_gather_, _channels_with_8_coeffs)(
12932-    float *output_buffer, unsigned int output_sub_size,
12933-    float const *decode_buffer,
12934-    stbir__contributors const *horizontal_contributors,
12935-    float const *horizontal_coefficients, int coefficient_width)
12936-{
12937-	float const *output_end =
12938-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
12939-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
12940-	STBIR_SIMD_NO_UNROLL_LOOP_START
12941-	do {
12942-		float const *decode = decode_buffer + horizontal_contributors->n0 *
12943-		                                          STBIR__horizontal_channels;
12944-		float const *hc = horizontal_coefficients;
12945-		stbir__4_coeff_start();
12946-		stbir__4_coeff_continue_from_4(4);
12947-		stbir__store_output();
12948-	} while (output < output_end);
12949-}
12950-
12951-static void
12952-STBIR_chans(stbir__horizontal_gather_, _channels_with_9_coeffs)(
12953-    float *output_buffer, unsigned int output_sub_size,
12954-    float const *decode_buffer,
12955-    stbir__contributors const *horizontal_contributors,
12956-    float const *horizontal_coefficients, int coefficient_width)
12957-{
12958-	float const *output_end =
12959-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
12960-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
12961-	STBIR_SIMD_NO_UNROLL_LOOP_START
12962-	do {
12963-		float const *decode = decode_buffer + horizontal_contributors->n0 *
12964-		                                          STBIR__horizontal_channels;
12965-		float const *hc = horizontal_coefficients;
12966-		stbir__4_coeff_start();
12967-		stbir__4_coeff_continue_from_4(4);
12968-		stbir__1_coeff_remnant(8);
12969-		stbir__store_output();
12970-	} while (output < output_end);
12971-}
12972-
12973-static void
12974-STBIR_chans(stbir__horizontal_gather_, _channels_with_10_coeffs)(
12975-    float *output_buffer, unsigned int output_sub_size,
12976-    float const *decode_buffer,
12977-    stbir__contributors const *horizontal_contributors,
12978-    float const *horizontal_coefficients, int coefficient_width)
12979-{
12980-	float const *output_end =
12981-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
12982-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
12983-	STBIR_SIMD_NO_UNROLL_LOOP_START
12984-	do {
12985-		float const *decode = decode_buffer + horizontal_contributors->n0 *
12986-		                                          STBIR__horizontal_channels;
12987-		float const *hc = horizontal_coefficients;
12988-		stbir__4_coeff_start();
12989-		stbir__4_coeff_continue_from_4(4);
12990-		stbir__2_coeff_remnant(8);
12991-		stbir__store_output();
12992-	} while (output < output_end);
12993-}
12994-
12995-static void
12996-STBIR_chans(stbir__horizontal_gather_, _channels_with_11_coeffs)(
12997-    float *output_buffer, unsigned int output_sub_size,
12998-    float const *decode_buffer,
12999-    stbir__contributors const *horizontal_contributors,
13000-    float const *horizontal_coefficients, int coefficient_width)
13001-{
13002-	float const *output_end =
13003-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
13004-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
13005-	stbir__3_coeff_setup();
13006-	STBIR_SIMD_NO_UNROLL_LOOP_START
13007-	do {
13008-		float const *decode = decode_buffer + horizontal_contributors->n0 *
13009-		                                          STBIR__horizontal_channels;
13010-		float const *hc = horizontal_coefficients;
13011-		stbir__4_coeff_start();
13012-		stbir__4_coeff_continue_from_4(4);
13013-		stbir__3_coeff_remnant(8);
13014-		stbir__store_output();
13015-	} while (output < output_end);
13016-}
13017-
13018-static void
13019-STBIR_chans(stbir__horizontal_gather_, _channels_with_12_coeffs)(
13020-    float *output_buffer, unsigned int output_sub_size,
13021-    float const *decode_buffer,
13022-    stbir__contributors const *horizontal_contributors,
13023-    float const *horizontal_coefficients, int coefficient_width)
13024-{
13025-	float const *output_end =
13026-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
13027-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
13028-	STBIR_SIMD_NO_UNROLL_LOOP_START
13029-	do {
13030-		float const *decode = decode_buffer + horizontal_contributors->n0 *
13031-		                                          STBIR__horizontal_channels;
13032-		float const *hc = horizontal_coefficients;
13033-		stbir__4_coeff_start();
13034-		stbir__4_coeff_continue_from_4(4);
13035-		stbir__4_coeff_continue_from_4(8);
13036-		stbir__store_output();
13037-	} while (output < output_end);
13038-}
13039-
13040-static void
13041-STBIR_chans(stbir__horizontal_gather_, _channels_with_n_coeffs_mod0)(
13042-    float *output_buffer, unsigned int output_sub_size,
13043-    float const *decode_buffer,
13044-    stbir__contributors const *horizontal_contributors,
13045-    float const *horizontal_coefficients, int coefficient_width)
13046-{
13047-	float const *output_end =
13048-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
13049-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
13050-	STBIR_SIMD_NO_UNROLL_LOOP_START
13051-	do {
13052-		float const *decode = decode_buffer + horizontal_contributors->n0 *
13053-		                                          STBIR__horizontal_channels;
13054-		int n =
13055-		    ((horizontal_contributors->n1 - horizontal_contributors->n0 + 1) -
13056-		     4 + 3) >>
13057-		    2;
13058-		float const *hc = horizontal_coefficients;
13059-
13060-		stbir__4_coeff_start();
13061-		STBIR_SIMD_NO_UNROLL_LOOP_START
13062-		do {
13063-			hc += 4;
13064-			decode += STBIR__horizontal_channels * 4;
13065-			stbir__4_coeff_continue_from_4(0);
13066-			--n;
13067-		} while (n > 0);
13068-		stbir__store_output();
13069-	} while (output < output_end);
13070-}
13071-
13072-static void
13073-STBIR_chans(stbir__horizontal_gather_, _channels_with_n_coeffs_mod1)(
13074-    float *output_buffer, unsigned int output_sub_size,
13075-    float const *decode_buffer,
13076-    stbir__contributors const *horizontal_contributors,
13077-    float const *horizontal_coefficients, int coefficient_width)
13078-{
13079-	float const *output_end =
13080-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
13081-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
13082-	STBIR_SIMD_NO_UNROLL_LOOP_START
13083-	do {
13084-		float const *decode = decode_buffer + horizontal_contributors->n0 *
13085-		                                          STBIR__horizontal_channels;
13086-		int n =
13087-		    ((horizontal_contributors->n1 - horizontal_contributors->n0 + 1) -
13088-		     5 + 3) >>
13089-		    2;
13090-		float const *hc = horizontal_coefficients;
13091-
13092-		stbir__4_coeff_start();
13093-		STBIR_SIMD_NO_UNROLL_LOOP_START
13094-		do {
13095-			hc += 4;
13096-			decode += STBIR__horizontal_channels * 4;
13097-			stbir__4_coeff_continue_from_4(0);
13098-			--n;
13099-		} while (n > 0);
13100-		stbir__1_coeff_remnant(4);
13101-		stbir__store_output();
13102-	} while (output < output_end);
13103-}
13104-
13105-static void
13106-STBIR_chans(stbir__horizontal_gather_, _channels_with_n_coeffs_mod2)(
13107-    float *output_buffer, unsigned int output_sub_size,
13108-    float const *decode_buffer,
13109-    stbir__contributors const *horizontal_contributors,
13110-    float const *horizontal_coefficients, int coefficient_width)
13111-{
13112-	float const *output_end =
13113-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
13114-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
13115-	STBIR_SIMD_NO_UNROLL_LOOP_START
13116-	do {
13117-		float const *decode = decode_buffer + horizontal_contributors->n0 *
13118-		                                          STBIR__horizontal_channels;
13119-		int n =
13120-		    ((horizontal_contributors->n1 - horizontal_contributors->n0 + 1) -
13121-		     6 + 3) >>
13122-		    2;
13123-		float const *hc = horizontal_coefficients;
13124-
13125-		stbir__4_coeff_start();
13126-		STBIR_SIMD_NO_UNROLL_LOOP_START
13127-		do {
13128-			hc += 4;
13129-			decode += STBIR__horizontal_channels * 4;
13130-			stbir__4_coeff_continue_from_4(0);
13131-			--n;
13132-		} while (n > 0);
13133-		stbir__2_coeff_remnant(4);
13134-
13135-		stbir__store_output();
13136-	} while (output < output_end);
13137-}
13138-
13139-static void
13140-STBIR_chans(stbir__horizontal_gather_, _channels_with_n_coeffs_mod3)(
13141-    float *output_buffer, unsigned int output_sub_size,
13142-    float const *decode_buffer,
13143-    stbir__contributors const *horizontal_contributors,
13144-    float const *horizontal_coefficients, int coefficient_width)
13145-{
13146-	float const *output_end =
13147-	    output_buffer + output_sub_size * STBIR__horizontal_channels;
13148-	float STBIR_SIMD_STREAMOUT_PTR(*) output = output_buffer;
13149-	stbir__3_coeff_setup();
13150-	STBIR_SIMD_NO_UNROLL_LOOP_START
13151-	do {
13152-		float const *decode = decode_buffer + horizontal_contributors->n0 *
13153-		                                          STBIR__horizontal_channels;
13154-		int n =
13155-		    ((horizontal_contributors->n1 - horizontal_contributors->n0 + 1) -
13156-		     7 + 3) >>
13157-		    2;
13158-		float const *hc = horizontal_coefficients;
13159-
13160-		stbir__4_coeff_start();
13161-		STBIR_SIMD_NO_UNROLL_LOOP_START
13162-		do {
13163-			hc += 4;
13164-			decode += STBIR__horizontal_channels * 4;
13165-			stbir__4_coeff_continue_from_4(0);
13166-			--n;
13167-		} while (n > 0);
13168-		stbir__3_coeff_remnant(4);
13169-
13170-		stbir__store_output();
13171-	} while (output < output_end);
13172-}
13173-
13174-static stbir__horizontal_gather_channels_func *
13175-    STBIR_chans(stbir__horizontal_gather_, _channels_with_n_coeffs_funcs)[4] = {
13176-        STBIR_chans(stbir__horizontal_gather_, _channels_with_n_coeffs_mod0),
13177-        STBIR_chans(stbir__horizontal_gather_, _channels_with_n_coeffs_mod1),
13178-        STBIR_chans(stbir__horizontal_gather_, _channels_with_n_coeffs_mod2),
13179-        STBIR_chans(stbir__horizontal_gather_, _channels_with_n_coeffs_mod3),
13180-};
13181-
13182-static stbir__horizontal_gather_channels_func *
13183-    STBIR_chans(stbir__horizontal_gather_, _channels_funcs)[12] = {
13184-        STBIR_chans(stbir__horizontal_gather_, _channels_with_1_coeff),
13185-        STBIR_chans(stbir__horizontal_gather_, _channels_with_2_coeffs),
13186-        STBIR_chans(stbir__horizontal_gather_, _channels_with_3_coeffs),
13187-        STBIR_chans(stbir__horizontal_gather_, _channels_with_4_coeffs),
13188-        STBIR_chans(stbir__horizontal_gather_, _channels_with_5_coeffs),
13189-        STBIR_chans(stbir__horizontal_gather_, _channels_with_6_coeffs),
13190-        STBIR_chans(stbir__horizontal_gather_, _channels_with_7_coeffs),
13191-        STBIR_chans(stbir__horizontal_gather_, _channels_with_8_coeffs),
13192-        STBIR_chans(stbir__horizontal_gather_, _channels_with_9_coeffs),
13193-        STBIR_chans(stbir__horizontal_gather_, _channels_with_10_coeffs),
13194-        STBIR_chans(stbir__horizontal_gather_, _channels_with_11_coeffs),
13195-        STBIR_chans(stbir__horizontal_gather_, _channels_with_12_coeffs),
13196-};
13197-
13198-#undef STBIR__horizontal_channels
13199-#undef STB_IMAGE_RESIZE_DO_HORIZONTALS
13200-#undef stbir__1_coeff_only
13201-#undef stbir__1_coeff_remnant
13202-#undef stbir__2_coeff_only
13203-#undef stbir__2_coeff_remnant
13204-#undef stbir__3_coeff_only
13205-#undef stbir__3_coeff_remnant
13206-#undef stbir__3_coeff_setup
13207-#undef stbir__4_coeff_start
13208-#undef stbir__4_coeff_continue_from_4
13209-#undef stbir__store_output
13210-#undef stbir__store_output_tiny
13211-#undef STBIR_chans
13212-
13213-#endif // HORIZONALS
13214-
13215-#undef STBIR_strs_join2
13216-#undef STBIR_strs_join1
13217-
13218-#endif // STB_IMAGE_RESIZE_DO_HORIZONTALS/VERTICALS/CODERS
13219-
13220-/*
13221-------------------------------------------------------------------------------
13222-This software is available under 2 licenses -- choose whichever you prefer.
13223-------------------------------------------------------------------------------
13224-ALTERNATIVE A - MIT License
13225-Copyright (c) 2017 Sean Barrett
13226-Permission is hereby granted, free of charge, to any person obtaining a copy of
13227-this software and associated documentation files (the "Software"), to deal in
13228-the Software without restriction, including without limitation the rights to
13229-use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
13230-of the Software, and to permit persons to whom the Software is furnished to do
13231-so, subject to the following conditions:
13232-The above copyright notice and this permission notice shall be included in all
13233-copies or substantial portions of the Software.
13234-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
13235-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
13236-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
13237-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
13238-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
13239-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
13240-SOFTWARE.
13241-------------------------------------------------------------------------------
13242-ALTERNATIVE B - Public Domain (www.unlicense.org)
13243-This is free and unencumbered software released into the public domain.
13244-Anyone is free to copy, modify, publish, use, compile, sell, or distribute this
13245-software, either in source code form or as a compiled binary, for any purpose,
13246-commercial or non-commercial, and by any means.
13247-In jurisdictions that recognize copyright laws, the author or authors of this
13248-software dedicate any and all copyright interest in the software to the public
13249-domain. We make this dedication for the benefit of the public at large and to
13250-the detriment of our heirs and successors. We intend this dedication to be an
13251-overt act of relinquishment in perpetuity of all present and future rights to
13252-this software under copyright law.
13253-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
13254-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
13255-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
13256-AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
13257-ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
13258-WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
13259-------------------------------------------------------------------------------
13260-*/