PostgreSQL Source Code  git master
 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros
tuplesort.c
Go to the documentation of this file.
1 /*-------------------------------------------------------------------------
2  *
3  * tuplesort.c
4  * Generalized tuple sorting routines.
5  *
6  * This module handles sorting of heap tuples, index tuples, or single
7  * Datums (and could easily support other kinds of sortable objects,
8  * if necessary). It works efficiently for both small and large amounts
9  * of data. Small amounts are sorted in-memory using qsort(). Large
10  * amounts are sorted using temporary files and a standard external sort
11  * algorithm.
12  *
13  * See Knuth, volume 3, for more than you want to know about the external
14  * sorting algorithm. Historically, we divided the input into sorted runs
15  * using replacement selection, in the form of a priority tree implemented
16  * as a heap (essentially his Algorithm 5.2.3H), but now we only do that
17  * for the first run, and only if the run would otherwise end up being very
18  * short. We merge the runs using polyphase merge, Knuth's Algorithm
19  * 5.4.2D. The logical "tapes" used by Algorithm D are implemented by
20  * logtape.c, which avoids space wastage by recycling disk space as soon
21  * as each block is read from its "tape".
22  *
23  * We do not use Knuth's recommended data structure (Algorithm 5.4.1R) for
24  * the replacement selection, because it uses a fixed number of records
25  * in memory at all times. Since we are dealing with tuples that may vary
26  * considerably in size, we want to be able to vary the number of records
27  * kept in memory to ensure full utilization of the allowed sort memory
28  * space. So, we keep the tuples in a variable-size heap, with the next
29  * record to go out at the top of the heap. Like Algorithm 5.4.1R, each
30  * record is stored with the run number that it must go into, and we use
31  * (run number, key) as the ordering key for the heap. When the run number
32  * at the top of the heap changes, we know that no more records of the prior
33  * run are left in the heap. Note that there are in practice only ever two
34  * distinct run numbers, because since PostgreSQL 9.6, we only use
35  * replacement selection to form the first run.
36  *
37  * In PostgreSQL 9.6, a heap (based on Knuth's Algorithm H, with some small
38  * customizations) is only used with the aim of producing just one run,
39  * thereby avoiding all merging. Only the first run can use replacement
40  * selection, which is why there are now only two possible valid run
41  * numbers, and why heapification is customized to not distinguish between
42  * tuples in the second run (those will be quicksorted). We generally
43  * prefer a simple hybrid sort-merge strategy, where runs are sorted in much
44  * the same way as the entire input of an internal sort is sorted (using
45  * qsort()). The replacement_sort_tuples GUC controls the limited remaining
46  * use of replacement selection for the first run.
47  *
48  * There are several reasons to favor a hybrid sort-merge strategy.
49  * Maintaining a priority tree/heap has poor CPU cache characteristics.
50  * Furthermore, the growth in main memory sizes has greatly diminished the
51  * value of having runs that are larger than available memory, even in the
52  * case where there is partially sorted input and runs can be made far
53  * larger by using a heap. In most cases, a single-pass merge step is all
54  * that is required even when runs are no larger than available memory.
55  * Avoiding multiple merge passes was traditionally considered to be the
56  * major advantage of using replacement selection.
57  *
58  * The approximate amount of memory allowed for any one sort operation
59  * is specified in kilobytes by the caller (most pass work_mem). Initially,
60  * we absorb tuples and simply store them in an unsorted array as long as
61  * we haven't exceeded workMem. If we reach the end of the input without
62  * exceeding workMem, we sort the array using qsort() and subsequently return
63  * tuples just by scanning the tuple array sequentially. If we do exceed
64  * workMem, we begin to emit tuples into sorted runs in temporary tapes.
65  * When tuples are dumped in batch after quicksorting, we begin a new run
66  * with a new output tape (selected per Algorithm D). After the end of the
67  * input is reached, we dump out remaining tuples in memory into a final run
68  * (or two, when replacement selection is still used), then merge the runs
69  * using Algorithm D.
70  *
71  * When merging runs, we use a heap containing just the frontmost tuple from
72  * each source run; we repeatedly output the smallest tuple and replace it
73  * with the next tuple from its source tape (if any). When the heap empties,
74  * the merge is complete. The basic merge algorithm thus needs very little
75  * memory --- only M tuples for an M-way merge, and M is constrained to a
76  * small number. However, we can still make good use of our full workMem
77  * allocation by pre-reading additional blocks from each source tape. Without
78  * prereading, our access pattern to the temporary file would be very erratic;
79  * on average we'd read one block from each of M source tapes during the same
80  * time that we're writing M blocks to the output tape, so there is no
81  * sequentiality of access at all, defeating the read-ahead methods used by
82  * most Unix kernels. Worse, the output tape gets written into a very random
83  * sequence of blocks of the temp file, ensuring that things will be even
84  * worse when it comes time to read that tape. A straightforward merge pass
85  * thus ends up doing a lot of waiting for disk seeks. We can improve matters
86  * by prereading from each source tape sequentially, loading about workMem/M
87  * bytes from each tape in turn, and making the sequential blocks immediately
88  * available for reuse. This approach helps to localize both read and write
89  * accesses. The pre-reading is handled by logtape.c, we just tell it how
90  * much memory to use for the buffers.
91  *
92  * When the caller requests random access to the sort result, we form
93  * the final sorted run on a logical tape which is then "frozen", so
94  * that we can access it randomly. When the caller does not need random
95  * access, we return from tuplesort_performsort() as soon as we are down
96  * to one run per logical tape. The final merge is then performed
97  * on-the-fly as the caller repeatedly calls tuplesort_getXXX; this
98  * saves one cycle of writing all the data out to disk and reading it in.
99  *
100  * Before Postgres 8.2, we always used a seven-tape polyphase merge, on the
101  * grounds that 7 is the "sweet spot" on the tapes-to-passes curve according
102  * to Knuth's figure 70 (section 5.4.2). However, Knuth is assuming that
103  * tape drives are expensive beasts, and in particular that there will always
104  * be many more runs than tape drives. In our implementation a "tape drive"
105  * doesn't cost much more than a few Kb of memory buffers, so we can afford
106  * to have lots of them. In particular, if we can have as many tape drives
107  * as sorted runs, we can eliminate any repeated I/O at all. In the current
108  * code we determine the number of tapes M on the basis of workMem: we want
109  * workMem/M to be large enough that we read a fair amount of data each time
110  * we preread from a tape, so as to maintain the locality of access described
111  * above. Nonetheless, with large workMem we can have many tapes (but not
112  * too many -- see the comments in tuplesort_merge_order).
113  *
114  *
115  * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
116  * Portions Copyright (c) 1994, Regents of the University of California
117  *
118  * IDENTIFICATION
119  * src/backend/utils/sort/tuplesort.c
120  *
121  *-------------------------------------------------------------------------
122  */
123 
124 #include "postgres.h"
125 
126 #include <limits.h>
127 
128 #include "access/htup_details.h"
129 #include "access/nbtree.h"
130 #include "catalog/index.h"
131 #include "catalog/pg_am.h"
132 #include "commands/tablespace.h"
133 #include "executor/executor.h"
134 #include "miscadmin.h"
135 #include "pg_trace.h"
136 #include "utils/datum.h"
137 #include "utils/logtape.h"
138 #include "utils/lsyscache.h"
139 #include "utils/memutils.h"
140 #include "utils/pg_rusage.h"
141 #include "utils/rel.h"
142 #include "utils/sortsupport.h"
143 #include "utils/tuplesort.h"
144 
145 
146 /* sort-type codes for sort__start probes */
147 #define HEAP_SORT 0
148 #define INDEX_SORT 1
149 #define DATUM_SORT 2
150 #define CLUSTER_SORT 3
151 
152 /* GUC variables */
153 #ifdef TRACE_SORT
154 bool trace_sort = false;
155 #endif
156 
157 #ifdef DEBUG_BOUNDED_SORT
158 bool optimize_bounded_sort = true;
159 #endif
160 
161 
162 /*
163  * The objects we actually sort are SortTuple structs. These contain
164  * a pointer to the tuple proper (might be a MinimalTuple or IndexTuple),
165  * which is a separate palloc chunk --- we assume it is just one chunk and
166  * can be freed by a simple pfree() (except during merge, when we use a
167  * simple slab allocator). SortTuples also contain the tuple's first key
168  * column in Datum/nullflag format, and an index integer.
169  *
170  * Storing the first key column lets us save heap_getattr or index_getattr
171  * calls during tuple comparisons. We could extract and save all the key
172  * columns not just the first, but this would increase code complexity and
173  * overhead, and wouldn't actually save any comparison cycles in the common
174  * case where the first key determines the comparison result. Note that
175  * for a pass-by-reference datatype, datum1 points into the "tuple" storage.
176  *
177  * There is one special case: when the sort support infrastructure provides an
178  * "abbreviated key" representation, where the key is (typically) a pass by
179  * value proxy for a pass by reference type. In this case, the abbreviated key
180  * is stored in datum1 in place of the actual first key column.
181  *
182  * When sorting single Datums, the data value is represented directly by
183  * datum1/isnull1 for pass by value types (or null values). If the datatype is
184  * pass-by-reference and isnull1 is false, then "tuple" points to a separately
185  * palloc'd data value, otherwise "tuple" is NULL. The value of datum1 is then
186  * either the same pointer as "tuple", or is an abbreviated key value as
187  * described above. Accordingly, "tuple" is always used in preference to
188  * datum1 as the authoritative value for pass-by-reference cases.
189  *
190  * While building initial runs, tupindex holds the tuple's run number.
191  * Historically, the run number could meaningfully distinguish many runs, but
192  * it now only distinguishes RUN_FIRST and HEAP_RUN_NEXT, since replacement
193  * selection is always abandoned after the first run; no other run number
194  * should be represented here. During merge passes, we re-use it to hold the
195  * input tape number that each tuple in the heap was read from. tupindex goes
196  * unused if the sort occurs entirely in memory.
197  */
198 typedef struct
199 {
200  void *tuple; /* the tuple itself */
201  Datum datum1; /* value of first key column */
202  bool isnull1; /* is first key column NULL? */
203  int tupindex; /* see notes above */
204 } SortTuple;
205 
206 /*
207  * During merge, we use a pre-allocated set of fixed-size slots to hold
208  * tuples. To avoid palloc/pfree overhead.
209  *
210  * Merge doesn't require a lot of memory, so we can afford to waste some,
211  * by using gratuitously-sized slots. If a tuple is larger than 1 kB, the
212  * palloc() overhead is not significant anymore.
213  *
214  * 'nextfree' is valid when this chunk is in the free list. When in use, the
215  * slot holds a tuple.
216  */
217 #define SLAB_SLOT_SIZE 1024
218 
219 typedef union SlabSlot
220 {
223 } SlabSlot;
224 
225 /*
226  * Possible states of a Tuplesort object. These denote the states that
227  * persist between calls of Tuplesort routines.
228  */
229 typedef enum
230 {
231  TSS_INITIAL, /* Loading tuples; still within memory limit */
232  TSS_BOUNDED, /* Loading tuples into bounded-size heap */
233  TSS_BUILDRUNS, /* Loading tuples; writing to tape */
234  TSS_SORTEDINMEM, /* Sort completed entirely in memory */
235  TSS_SORTEDONTAPE, /* Sort completed, final run is on tape */
236  TSS_FINALMERGE /* Performing final merge on-the-fly */
237 } TupSortStatus;
238 
239 /*
240  * Parameters for calculation of number of tapes to use --- see inittapes()
241  * and tuplesort_merge_order().
242  *
243  * In this calculation we assume that each tape will cost us about 1 blocks
244  * worth of buffer space. This ignores the overhead of all the other data
245  * structures needed for each tape, but it's probably close enough.
246  *
247  * MERGE_BUFFER_SIZE is how much data we'd like to read from each input
248  * tape during a preread cycle (see discussion at top of file).
249  */
250 #define MINORDER 6 /* minimum merge order */
251 #define MAXORDER 500 /* maximum merge order */
252 #define TAPE_BUFFER_OVERHEAD BLCKSZ
253 #define MERGE_BUFFER_SIZE (BLCKSZ * 32)
254 
255  /*
256  * Run numbers, used during external sort operations.
257  *
258  * HEAP_RUN_NEXT is only used for SortTuple.tupindex, never state.currentRun.
259  */
260 #define RUN_FIRST 0
261 #define HEAP_RUN_NEXT INT_MAX
262 #define RUN_SECOND 1
263 
264 typedef int (*SortTupleComparator) (const SortTuple *a, const SortTuple *b,
266 
267 /*
268  * Private state of a Tuplesort operation.
269  */
271 {
272  TupSortStatus status; /* enumerated value as shown above */
273  int nKeys; /* number of columns in sort key */
274  bool randomAccess; /* did caller request random access? */
275  bool bounded; /* did caller specify a maximum number of
276  * tuples to return? */
277  bool boundUsed; /* true if we made use of a bounded heap */
278  int bound; /* if bounded, the maximum number of tuples */
279  bool tuples; /* Can SortTuple.tuple ever be set? */
280  int64 availMem; /* remaining memory available, in bytes */
281  int64 allowedMem; /* total memory allowed, in bytes */
282  int maxTapes; /* number of tapes (Knuth's T) */
283  int tapeRange; /* maxTapes-1 (Knuth's P) */
284  MemoryContext sortcontext; /* memory context holding most sort data */
285  MemoryContext tuplecontext; /* sub-context of sortcontext for tuple data */
286  LogicalTapeSet *tapeset; /* logtape.c object for tapes in a temp file */
287 
288  /*
289  * These function pointers decouple the routines that must know what kind
290  * of tuple we are sorting from the routines that don't need to know it.
291  * They are set up by the tuplesort_begin_xxx routines.
292  *
293  * Function to compare two tuples; result is per qsort() convention, ie:
294  * <0, 0, >0 according as a<b, a=b, a>b. The API must match
295  * qsort_arg_comparator.
296  */
298 
299  /*
300  * Function to copy a supplied input tuple into palloc'd space and set up
301  * its SortTuple representation (ie, set tuple/datum1/isnull1). Also,
302  * state->availMem must be decreased by the amount of space used for the
303  * tuple copy (note the SortTuple struct itself is not counted).
304  */
305  void (*copytup) (Tuplesortstate *state, SortTuple *stup, void *tup);
306 
307  /*
308  * Function to write a stored tuple onto tape. The representation of the
309  * tuple on tape need not be the same as it is in memory; requirements on
310  * the tape representation are given below. Unless the slab allocator is
311  * used, after writing the tuple, pfree() the out-of-line data (not the
312  * SortTuple struct!), and increase state->availMem by the amount of
313  * memory space thereby released.
314  */
315  void (*writetup) (Tuplesortstate *state, int tapenum,
316  SortTuple *stup);
317 
318  /*
319  * Function to read a stored tuple from tape back into memory. 'len' is
320  * the already-read length of the stored tuple. The tuple is allocated
321  * from the slab memory arena, or is palloc'd, see readtup_alloc().
322  */
323  void (*readtup) (Tuplesortstate *state, SortTuple *stup,
324  int tapenum, unsigned int len);
325 
326  /*
327  * This array holds the tuples now in sort memory. If we are in state
328  * INITIAL, the tuples are in no particular order; if we are in state
329  * SORTEDINMEM, the tuples are in final sorted order; in states BUILDRUNS
330  * and FINALMERGE, the tuples are organized in "heap" order per Algorithm
331  * H. In state SORTEDONTAPE, the array is not used.
332  */
333  SortTuple *memtuples; /* array of SortTuple structs */
334  int memtupcount; /* number of tuples currently present */
335  int memtupsize; /* allocated length of memtuples array */
336  bool growmemtuples; /* memtuples' growth still underway? */
337 
338  /*
339  * Memory for tuples is sometimes allocated using a simple slab allocator,
340  * rather than with palloc(). Currently, we switch to slab allocation
341  * when we start merging. Merging only needs to keep a small, fixed
342  * number of tuples in memory at any time, so we can avoid the
343  * palloc/pfree overhead by recycling a fixed number of fixed-size slots
344  * to hold the tuples.
345  *
346  * For the slab, we use one large allocation, divided into SLAB_SLOT_SIZE
347  * slots. The allocation is sized to have one slot per tape, plus one
348  * additional slot. We need that many slots to hold all the tuples kept
349  * in the heap during merge, plus the one we have last returned from the
350  * sort, with tuplesort_gettuple.
351  *
352  * Initially, all the slots are kept in a linked list of free slots. When
353  * a tuple is read from a tape, it is put to the next available slot, if
354  * it fits. If the tuple is larger than SLAB_SLOT_SIZE, it is palloc'd
355  * instead.
356  *
357  * When we're done processing a tuple, we return the slot back to the free
358  * list, or pfree() if it was palloc'd. We know that a tuple was
359  * allocated from the slab, if its pointer value is between
360  * slabMemoryBegin and -End.
361  *
362  * When the slab allocator is used, the USEMEM/LACKMEM mechanism of
363  * tracking memory usage is not used.
364  */
366 
367  char *slabMemoryBegin; /* beginning of slab memory arena */
368  char *slabMemoryEnd; /* end of slab memory arena */
369  SlabSlot *slabFreeHead; /* head of free list */
370 
371  /* Buffer size to use for reading input tapes, during merge. */
373 
374  /*
375  * When we return a tuple to the caller in tuplesort_gettuple_XXX, that
376  * came from a tape (that is, in TSS_SORTEDONTAPE or TSS_FINALMERGE
377  * modes), we remember the tuple in 'lastReturnedTuple', so that we can
378  * recycle the memory on next gettuple call.
379  */
381 
382  /*
383  * While building initial runs, this indicates if the replacement
384  * selection strategy is in use. When it isn't, then a simple hybrid
385  * sort-merge strategy is in use instead (runs are quicksorted).
386  */
388 
389  /*
390  * While building initial runs, this is the current output run number
391  * (starting at RUN_FIRST). Afterwards, it is the number of initial runs
392  * we made.
393  */
395 
396  /*
397  * Unless otherwise noted, all pointer variables below are pointers to
398  * arrays of length maxTapes, holding per-tape data.
399  */
400 
401  /*
402  * This variable is only used during merge passes. mergeactive[i] is true
403  * if we are reading an input run from (actual) tape number i and have not
404  * yet exhausted that run.
405  */
406  bool *mergeactive; /* active input run source? */
407 
408  /*
409  * Variables for Algorithm D. Note that destTape is a "logical" tape
410  * number, ie, an index into the tp_xxx[] arrays. Be careful to keep
411  * "logical" and "actual" tape numbers straight!
412  */
413  int Level; /* Knuth's l */
414  int destTape; /* current output tape (Knuth's j, less 1) */
415  int *tp_fib; /* Target Fibonacci run counts (A[]) */
416  int *tp_runs; /* # of real runs on each tape */
417  int *tp_dummy; /* # of dummy runs for each tape (D[]) */
418  int *tp_tapenum; /* Actual tape numbers (TAPE[]) */
419  int activeTapes; /* # of active input tapes in merge pass */
420 
421  /*
422  * These variables are used after completion of sorting to keep track of
423  * the next tuple to return. (In the tape case, the tape's current read
424  * position is also critical state.)
425  */
426  int result_tape; /* actual tape number of finished output */
427  int current; /* array index (only used if SORTEDINMEM) */
428  bool eof_reached; /* reached EOF (needed for cursors) */
429 
430  /* markpos_xxx holds marked position for mark and restore */
431  long markpos_block; /* tape block# (only used if SORTEDONTAPE) */
432  int markpos_offset; /* saved "current", or offset in tape block */
433  bool markpos_eof; /* saved "eof_reached" */
434 
435  /*
436  * The sortKeys variable is used by every case other than the hash index
437  * case; it is set by tuplesort_begin_xxx. tupDesc is only used by the
438  * MinimalTuple and CLUSTER routines, though.
439  */
441  SortSupport sortKeys; /* array of length nKeys */
442 
443  /*
444  * This variable is shared by the single-key MinimalTuple case and the
445  * Datum case (which both use qsort_ssup()). Otherwise it's NULL.
446  */
448 
449  /*
450  * Additional state for managing "abbreviated key" sortsupport routines
451  * (which currently may be used by all cases except the hash index case).
452  * Tracks the intervals at which the optimization's effectiveness is
453  * tested.
454  */
455  int64 abbrevNext; /* Tuple # at which to next check
456  * applicability */
457 
458  /*
459  * These variables are specific to the CLUSTER case; they are set by
460  * tuplesort_begin_cluster.
461  */
462  IndexInfo *indexInfo; /* info about index being used for reference */
463  EState *estate; /* for evaluating index expressions */
464 
465  /*
466  * These variables are specific to the IndexTuple case; they are set by
467  * tuplesort_begin_index_xxx and used only by the IndexTuple routines.
468  */
469  Relation heapRel; /* table the index is being built on */
470  Relation indexRel; /* index being built */
471 
472  /* These are specific to the index_btree subcase: */
473  bool enforceUnique; /* complain if we find duplicate tuples */
474 
475  /* These are specific to the index_hash subcase: */
476  uint32 hash_mask; /* mask for sortable part of hash code */
477 
478  /*
479  * These variables are specific to the Datum case; they are set by
480  * tuplesort_begin_datum and used only by the DatumTuple routines.
481  */
483  /* we need typelen in order to know how to copy the Datums. */
485 
486  /*
487  * Resource snapshot for time of sort start.
488  */
489 #ifdef TRACE_SORT
491 #endif
492 };
493 
494 /*
495  * Is the given tuple allocated from the slab memory arena?
496  */
497 #define IS_SLAB_SLOT(state, tuple) \
498  ((char *) (tuple) >= (state)->slabMemoryBegin && \
499  (char *) (tuple) < (state)->slabMemoryEnd)
500 
501 /*
502  * Return the given tuple to the slab memory free list, or free it
503  * if it was palloc'd.
504  */
505 #define RELEASE_SLAB_SLOT(state, tuple) \
506  do { \
507  SlabSlot *buf = (SlabSlot *) tuple; \
508  \
509  if (IS_SLAB_SLOT((state), buf)) \
510  { \
511  buf->nextfree = (state)->slabFreeHead; \
512  (state)->slabFreeHead = buf; \
513  } else \
514  pfree(buf); \
515  } while(0)
516 
517 #define COMPARETUP(state,a,b) ((*(state)->comparetup) (a, b, state))
518 #define COPYTUP(state,stup,tup) ((*(state)->copytup) (state, stup, tup))
519 #define WRITETUP(state,tape,stup) ((*(state)->writetup) (state, tape, stup))
520 #define READTUP(state,stup,tape,len) ((*(state)->readtup) (state, stup, tape, len))
521 #define LACKMEM(state) ((state)->availMem < 0 && !(state)->slabAllocatorUsed)
522 #define USEMEM(state,amt) ((state)->availMem -= (amt))
523 #define FREEMEM(state,amt) ((state)->availMem += (amt))
524 
525 /*
526  * NOTES about on-tape representation of tuples:
527  *
528  * We require the first "unsigned int" of a stored tuple to be the total size
529  * on-tape of the tuple, including itself (so it is never zero; an all-zero
530  * unsigned int is used to delimit runs). The remainder of the stored tuple
531  * may or may not match the in-memory representation of the tuple ---
532  * any conversion needed is the job of the writetup and readtup routines.
533  *
534  * If state->randomAccess is true, then the stored representation of the
535  * tuple must be followed by another "unsigned int" that is a copy of the
536  * length --- so the total tape space used is actually sizeof(unsigned int)
537  * more than the stored length value. This allows read-backwards. When
538  * randomAccess is not true, the write/read routines may omit the extra
539  * length word.
540  *
541  * writetup is expected to write both length words as well as the tuple
542  * data. When readtup is called, the tape is positioned just after the
543  * front length word; readtup must read the tuple data and advance past
544  * the back length word (if present).
545  *
546  * The write/read routines can make use of the tuple description data
547  * stored in the Tuplesortstate record, if needed. They are also expected
548  * to adjust state->availMem by the amount of memory space (not tape space!)
549  * released or consumed. There is no error return from either writetup
550  * or readtup; they should ereport() on failure.
551  *
552  *
553  * NOTES about memory consumption calculations:
554  *
555  * We count space allocated for tuples against the workMem limit, plus
556  * the space used by the variable-size memtuples array. Fixed-size space
557  * is not counted; it's small enough to not be interesting.
558  *
559  * Note that we count actual space used (as shown by GetMemoryChunkSpace)
560  * rather than the originally-requested size. This is important since
561  * palloc can add substantial overhead. It's not a complete answer since
562  * we won't count any wasted space in palloc allocation blocks, but it's
563  * a lot better than what we were doing before 7.3. As of 9.6, a
564  * separate memory context is used for caller passed tuples. Resetting
565  * it at certain key increments significantly ameliorates fragmentation.
566  * Note that this places a responsibility on readtup and copytup routines
567  * to use the right memory context for these tuples (and to not use the
568  * reset context for anything whose lifetime needs to span multiple
569  * external sort runs).
570  */
571 
572 /* When using this macro, beware of double evaluation of len */
573 #define LogicalTapeReadExact(tapeset, tapenum, ptr, len) \
574  do { \
575  if (LogicalTapeRead(tapeset, tapenum, ptr, len) != (size_t) (len)) \
576  elog(ERROR, "unexpected end of data"); \
577  } while(0)
578 
579 
580 static Tuplesortstate *tuplesort_begin_common(int workMem, bool randomAccess);
581 static void puttuple_common(Tuplesortstate *state, SortTuple *tuple);
583 static bool useselection(Tuplesortstate *state);
584 static void inittapes(Tuplesortstate *state);
585 static void selectnewtape(Tuplesortstate *state);
586 static void init_slab_allocator(Tuplesortstate *state, int numSlots);
587 static void mergeruns(Tuplesortstate *state);
588 static void mergeonerun(Tuplesortstate *state);
589 static void beginmerge(Tuplesortstate *state);
590 static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup);
591 static void dumptuples(Tuplesortstate *state, bool alltuples);
592 static void dumpbatch(Tuplesortstate *state, bool alltuples);
597  bool checkIndex);
599  bool checkIndex);
600 static void tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex);
602 static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK);
603 static void markrunend(Tuplesortstate *state, int tapenum);
604 static void *readtup_alloc(Tuplesortstate *state, Size tuplen);
605 static int comparetup_heap(const SortTuple *a, const SortTuple *b,
607 static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup);
608 static void writetup_heap(Tuplesortstate *state, int tapenum,
609  SortTuple *stup);
610 static void readtup_heap(Tuplesortstate *state, SortTuple *stup,
611  int tapenum, unsigned int len);
612 static int comparetup_cluster(const SortTuple *a, const SortTuple *b,
614 static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup);
615 static void writetup_cluster(Tuplesortstate *state, int tapenum,
616  SortTuple *stup);
617 static void readtup_cluster(Tuplesortstate *state, SortTuple *stup,
618  int tapenum, unsigned int len);
619 static int comparetup_index_btree(const SortTuple *a, const SortTuple *b,
621 static int comparetup_index_hash(const SortTuple *a, const SortTuple *b,
623 static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup);
624 static void writetup_index(Tuplesortstate *state, int tapenum,
625  SortTuple *stup);
626 static void readtup_index(Tuplesortstate *state, SortTuple *stup,
627  int tapenum, unsigned int len);
628 static int comparetup_datum(const SortTuple *a, const SortTuple *b,
630 static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup);
631 static void writetup_datum(Tuplesortstate *state, int tapenum,
632  SortTuple *stup);
633 static void readtup_datum(Tuplesortstate *state, SortTuple *stup,
634  int tapenum, unsigned int len);
635 static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup);
636 
637 /*
638  * Special versions of qsort just for SortTuple objects. qsort_tuple() sorts
639  * any variant of SortTuples, using the appropriate comparetup function.
640  * qsort_ssup() is specialized for the case where the comparetup function
641  * reduces to ApplySortComparator(), that is single-key MinimalTuple sorts
642  * and Datum sorts.
643  */
644 #include "qsort_tuple.c"
645 
646 
647 /*
648  * tuplesort_begin_xxx
649  *
650  * Initialize for a tuple sort operation.
651  *
652  * After calling tuplesort_begin, the caller should call tuplesort_putXXX
653  * zero or more times, then call tuplesort_performsort when all the tuples
654  * have been supplied. After performsort, retrieve the tuples in sorted
655  * order by calling tuplesort_getXXX until it returns false/NULL. (If random
656  * access was requested, rescan, markpos, and restorepos can also be called.)
657  * Call tuplesort_end to terminate the operation and release memory/disk space.
658  *
659  * Each variant of tuplesort_begin has a workMem parameter specifying the
660  * maximum number of kilobytes of RAM to use before spilling data to disk.
661  * (The normal value of this parameter is work_mem, but some callers use
662  * other values.) Each variant also has a randomAccess parameter specifying
663  * whether the caller needs non-sequential access to the sort result.
664  */
665 
666 static Tuplesortstate *
667 tuplesort_begin_common(int workMem, bool randomAccess)
668 {
670  MemoryContext sortcontext;
671  MemoryContext tuplecontext;
672  MemoryContext oldcontext;
673 
674  /*
675  * Create a working memory context for this sort operation. All data
676  * needed by the sort will live inside this context.
677  */
679  "TupleSort main",
681 
682  /*
683  * Caller tuple (e.g. IndexTuple) memory context.
684  *
685  * A dedicated child context used exclusively for caller passed tuples
686  * eases memory management. Resetting at key points reduces
687  * fragmentation. Note that the memtuples array of SortTuples is allocated
688  * in the parent context, not this context, because there is no need to
689  * free memtuples early.
690  */
691  tuplecontext = AllocSetContextCreate(sortcontext,
692  "Caller tuples",
694 
695  /*
696  * Make the Tuplesortstate within the per-sort context. This way, we
697  * don't need a separate pfree() operation for it at shutdown.
698  */
699  oldcontext = MemoryContextSwitchTo(sortcontext);
700 
701  state = (Tuplesortstate *) palloc0(sizeof(Tuplesortstate));
702 
703 #ifdef TRACE_SORT
704  if (trace_sort)
705  pg_rusage_init(&state->ru_start);
706 #endif
707 
708  state->status = TSS_INITIAL;
709  state->randomAccess = randomAccess;
710  state->bounded = false;
711  state->tuples = true;
712  state->boundUsed = false;
713  state->allowedMem = workMem * (int64) 1024;
714  state->availMem = state->allowedMem;
715  state->sortcontext = sortcontext;
716  state->tuplecontext = tuplecontext;
717  state->tapeset = NULL;
718 
719  state->memtupcount = 0;
720 
721  /*
722  * Initial size of array must be more than ALLOCSET_SEPARATE_THRESHOLD;
723  * see comments in grow_memtuples().
724  */
725  state->memtupsize = Max(1024,
726  ALLOCSET_SEPARATE_THRESHOLD / sizeof(SortTuple) + 1);
727 
728  state->growmemtuples = true;
729  state->slabAllocatorUsed = false;
730  state->memtuples = (SortTuple *) palloc(state->memtupsize * sizeof(SortTuple));
731 
732  USEMEM(state, GetMemoryChunkSpace(state->memtuples));
733 
734  /* workMem must be large enough for the minimal memtuples array */
735  if (LACKMEM(state))
736  elog(ERROR, "insufficient memory allowed for sort");
737 
738  state->currentRun = RUN_FIRST;
739 
740  /*
741  * maxTapes, tapeRange, and Algorithm D variables will be initialized by
742  * inittapes(), if needed
743  */
744 
745  state->result_tape = -1; /* flag that result tape has not been formed */
746 
747  MemoryContextSwitchTo(oldcontext);
748 
749  return state;
750 }
751 
754  int nkeys, AttrNumber *attNums,
755  Oid *sortOperators, Oid *sortCollations,
756  bool *nullsFirstFlags,
757  int workMem, bool randomAccess)
758 {
759  Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
760  MemoryContext oldcontext;
761  int i;
762 
763  oldcontext = MemoryContextSwitchTo(state->sortcontext);
764 
765  AssertArg(nkeys > 0);
766 
767 #ifdef TRACE_SORT
768  if (trace_sort)
769  elog(LOG,
770  "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
771  nkeys, workMem, randomAccess ? 't' : 'f');
772 #endif
773 
774  state->nKeys = nkeys;
775 
776  TRACE_POSTGRESQL_SORT_START(HEAP_SORT,
777  false, /* no unique check */
778  nkeys,
779  workMem,
780  randomAccess);
781 
782  state->comparetup = comparetup_heap;
783  state->copytup = copytup_heap;
784  state->writetup = writetup_heap;
785  state->readtup = readtup_heap;
786 
787  state->tupDesc = tupDesc; /* assume we need not copy tupDesc */
788  state->abbrevNext = 10;
789 
790  /* Prepare SortSupport data for each column */
791  state->sortKeys = (SortSupport) palloc0(nkeys * sizeof(SortSupportData));
792 
793  for (i = 0; i < nkeys; i++)
794  {
795  SortSupport sortKey = state->sortKeys + i;
796 
797  AssertArg(attNums[i] != 0);
798  AssertArg(sortOperators[i] != 0);
799 
800  sortKey->ssup_cxt = CurrentMemoryContext;
801  sortKey->ssup_collation = sortCollations[i];
802  sortKey->ssup_nulls_first = nullsFirstFlags[i];
803  sortKey->ssup_attno = attNums[i];
804  /* Convey if abbreviation optimization is applicable in principle */
805  sortKey->abbreviate = (i == 0);
806 
807  PrepareSortSupportFromOrderingOp(sortOperators[i], sortKey);
808  }
809 
810  /*
811  * The "onlyKey" optimization cannot be used with abbreviated keys, since
812  * tie-breaker comparisons may be required. Typically, the optimization
813  * is only of value to pass-by-value types anyway, whereas abbreviated
814  * keys are typically only of value to pass-by-reference types.
815  */
816  if (nkeys == 1 && !state->sortKeys->abbrev_converter)
817  state->onlyKey = state->sortKeys;
818 
819  MemoryContextSwitchTo(oldcontext);
820 
821  return state;
822 }
823 
826  Relation indexRel,
827  int workMem, bool randomAccess)
828 {
829  Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
830  ScanKey indexScanKey;
831  MemoryContext oldcontext;
832  int i;
833 
834  Assert(indexRel->rd_rel->relam == BTREE_AM_OID);
835 
836  oldcontext = MemoryContextSwitchTo(state->sortcontext);
837 
838 #ifdef TRACE_SORT
839  if (trace_sort)
840  elog(LOG,
841  "begin tuple sort: nkeys = %d, workMem = %d, randomAccess = %c",
843  workMem, randomAccess ? 't' : 'f');
844 #endif
845 
846  state->nKeys = RelationGetNumberOfAttributes(indexRel);
847 
848  TRACE_POSTGRESQL_SORT_START(CLUSTER_SORT,
849  false, /* no unique check */
850  state->nKeys,
851  workMem,
852  randomAccess);
853 
855  state->copytup = copytup_cluster;
856  state->writetup = writetup_cluster;
857  state->readtup = readtup_cluster;
858  state->abbrevNext = 10;
859 
860  state->indexInfo = BuildIndexInfo(indexRel);
861 
862  state->tupDesc = tupDesc; /* assume we need not copy tupDesc */
863 
864  indexScanKey = _bt_mkscankey_nodata(indexRel);
865 
866  if (state->indexInfo->ii_Expressions != NULL)
867  {
868  TupleTableSlot *slot;
869  ExprContext *econtext;
870 
871  /*
872  * We will need to use FormIndexDatum to evaluate the index
873  * expressions. To do that, we need an EState, as well as a
874  * TupleTableSlot to put the table tuples into. The econtext's
875  * scantuple has to point to that slot, too.
876  */
877  state->estate = CreateExecutorState();
878  slot = MakeSingleTupleTableSlot(tupDesc);
879  econtext = GetPerTupleExprContext(state->estate);
880  econtext->ecxt_scantuple = slot;
881  }
882 
883  /* Prepare SortSupport data for each column */
884  state->sortKeys = (SortSupport) palloc0(state->nKeys *
885  sizeof(SortSupportData));
886 
887  for (i = 0; i < state->nKeys; i++)
888  {
889  SortSupport sortKey = state->sortKeys + i;
890  ScanKey scanKey = indexScanKey + i;
891  int16 strategy;
892 
893  sortKey->ssup_cxt = CurrentMemoryContext;
894  sortKey->ssup_collation = scanKey->sk_collation;
895  sortKey->ssup_nulls_first =
896  (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
897  sortKey->ssup_attno = scanKey->sk_attno;
898  /* Convey if abbreviation optimization is applicable in principle */
899  sortKey->abbreviate = (i == 0);
900 
901  AssertState(sortKey->ssup_attno != 0);
902 
903  strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
905 
906  PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
907  }
908 
909  _bt_freeskey(indexScanKey);
910 
911  MemoryContextSwitchTo(oldcontext);
912 
913  return state;
914 }
915 
918  Relation indexRel,
919  bool enforceUnique,
920  int workMem, bool randomAccess)
921 {
922  Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
923  ScanKey indexScanKey;
924  MemoryContext oldcontext;
925  int i;
926 
927  oldcontext = MemoryContextSwitchTo(state->sortcontext);
928 
929 #ifdef TRACE_SORT
930  if (trace_sort)
931  elog(LOG,
932  "begin index sort: unique = %c, workMem = %d, randomAccess = %c",
933  enforceUnique ? 't' : 'f',
934  workMem, randomAccess ? 't' : 'f');
935 #endif
936 
937  state->nKeys = RelationGetNumberOfAttributes(indexRel);
938 
939  TRACE_POSTGRESQL_SORT_START(INDEX_SORT,
940  enforceUnique,
941  state->nKeys,
942  workMem,
943  randomAccess);
944 
946  state->copytup = copytup_index;
947  state->writetup = writetup_index;
948  state->readtup = readtup_index;
949  state->abbrevNext = 10;
950 
951  state->heapRel = heapRel;
952  state->indexRel = indexRel;
953  state->enforceUnique = enforceUnique;
954 
955  indexScanKey = _bt_mkscankey_nodata(indexRel);
956  state->nKeys = RelationGetNumberOfAttributes(indexRel);
957 
958  /* Prepare SortSupport data for each column */
959  state->sortKeys = (SortSupport) palloc0(state->nKeys *
960  sizeof(SortSupportData));
961 
962  for (i = 0; i < state->nKeys; i++)
963  {
964  SortSupport sortKey = state->sortKeys + i;
965  ScanKey scanKey = indexScanKey + i;
966  int16 strategy;
967 
968  sortKey->ssup_cxt = CurrentMemoryContext;
969  sortKey->ssup_collation = scanKey->sk_collation;
970  sortKey->ssup_nulls_first =
971  (scanKey->sk_flags & SK_BT_NULLS_FIRST) != 0;
972  sortKey->ssup_attno = scanKey->sk_attno;
973  /* Convey if abbreviation optimization is applicable in principle */
974  sortKey->abbreviate = (i == 0);
975 
976  AssertState(sortKey->ssup_attno != 0);
977 
978  strategy = (scanKey->sk_flags & SK_BT_DESC) != 0 ?
980 
981  PrepareSortSupportFromIndexRel(indexRel, strategy, sortKey);
982  }
983 
984  _bt_freeskey(indexScanKey);
985 
986  MemoryContextSwitchTo(oldcontext);
987 
988  return state;
989 }
990 
993  Relation indexRel,
994  uint32 hash_mask,
995  int workMem, bool randomAccess)
996 {
997  Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
998  MemoryContext oldcontext;
999 
1000  oldcontext = MemoryContextSwitchTo(state->sortcontext);
1001 
1002 #ifdef TRACE_SORT
1003  if (trace_sort)
1004  elog(LOG,
1005  "begin index sort: hash_mask = 0x%x, workMem = %d, randomAccess = %c",
1006  hash_mask,
1007  workMem, randomAccess ? 't' : 'f');
1008 #endif
1009 
1010  state->nKeys = 1; /* Only one sort column, the hash code */
1011 
1013  state->copytup = copytup_index;
1014  state->writetup = writetup_index;
1015  state->readtup = readtup_index;
1016 
1017  state->heapRel = heapRel;
1018  state->indexRel = indexRel;
1019 
1020  state->hash_mask = hash_mask;
1021 
1022  MemoryContextSwitchTo(oldcontext);
1023 
1024  return state;
1025 }
1026 
1028 tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation,
1029  bool nullsFirstFlag,
1030  int workMem, bool randomAccess)
1031 {
1032  Tuplesortstate *state = tuplesort_begin_common(workMem, randomAccess);
1033  MemoryContext oldcontext;
1034  int16 typlen;
1035  bool typbyval;
1036 
1037  oldcontext = MemoryContextSwitchTo(state->sortcontext);
1038 
1039 #ifdef TRACE_SORT
1040  if (trace_sort)
1041  elog(LOG,
1042  "begin datum sort: workMem = %d, randomAccess = %c",
1043  workMem, randomAccess ? 't' : 'f');
1044 #endif
1045 
1046  state->nKeys = 1; /* always a one-column sort */
1047 
1048  TRACE_POSTGRESQL_SORT_START(DATUM_SORT,
1049  false, /* no unique check */
1050  1,
1051  workMem,
1052  randomAccess);
1053 
1054  state->comparetup = comparetup_datum;
1055  state->copytup = copytup_datum;
1056  state->writetup = writetup_datum;
1057  state->readtup = readtup_datum;
1058  state->abbrevNext = 10;
1059 
1060  state->datumType = datumType;
1061 
1062  /* lookup necessary attributes of the datum type */
1063  get_typlenbyval(datumType, &typlen, &typbyval);
1064  state->datumTypeLen = typlen;
1065  state->tuples = !typbyval;
1066 
1067  /* Prepare SortSupport data */
1068  state->sortKeys = (SortSupport) palloc0(sizeof(SortSupportData));
1069 
1071  state->sortKeys->ssup_collation = sortCollation;
1072  state->sortKeys->ssup_nulls_first = nullsFirstFlag;
1073 
1074  /*
1075  * Abbreviation is possible here only for by-reference types. In theory,
1076  * a pass-by-value datatype could have an abbreviated form that is cheaper
1077  * to compare. In a tuple sort, we could support that, because we can
1078  * always extract the original datum from the tuple is needed. Here, we
1079  * can't, because a datum sort only stores a single copy of the datum; the
1080  * "tuple" field of each sortTuple is NULL.
1081  */
1082  state->sortKeys->abbreviate = !typbyval;
1083 
1084  PrepareSortSupportFromOrderingOp(sortOperator, state->sortKeys);
1085 
1086  /*
1087  * The "onlyKey" optimization cannot be used with abbreviated keys, since
1088  * tie-breaker comparisons may be required. Typically, the optimization
1089  * is only of value to pass-by-value types anyway, whereas abbreviated
1090  * keys are typically only of value to pass-by-reference types.
1091  */
1092  if (!state->sortKeys->abbrev_converter)
1093  state->onlyKey = state->sortKeys;
1094 
1095  MemoryContextSwitchTo(oldcontext);
1096 
1097  return state;
1098 }
1099 
1100 /*
1101  * tuplesort_set_bound
1102  *
1103  * Advise tuplesort that at most the first N result tuples are required.
1104  *
1105  * Must be called before inserting any tuples. (Actually, we could allow it
1106  * as long as the sort hasn't spilled to disk, but there seems no need for
1107  * delayed calls at the moment.)
1108  *
1109  * This is a hint only. The tuplesort may still return more tuples than
1110  * requested.
1111  */
1112 void
1114 {
1115  /* Assert we're called before loading any tuples */
1116  Assert(state->status == TSS_INITIAL);
1117  Assert(state->memtupcount == 0);
1118  Assert(!state->bounded);
1119 
1120 #ifdef DEBUG_BOUNDED_SORT
1121  /* Honor GUC setting that disables the feature (for easy testing) */
1122  if (!optimize_bounded_sort)
1123  return;
1124 #endif
1125 
1126  /* We want to be able to compute bound * 2, so limit the setting */
1127  if (bound > (int64) (INT_MAX / 2))
1128  return;
1129 
1130  state->bounded = true;
1131  state->bound = (int) bound;
1132 
1133  /*
1134  * Bounded sorts are not an effective target for abbreviated key
1135  * optimization. Disable by setting state to be consistent with no
1136  * abbreviation support.
1137  */
1138  state->sortKeys->abbrev_converter = NULL;
1139  if (state->sortKeys->abbrev_full_comparator)
1141 
1142  /* Not strictly necessary, but be tidy */
1143  state->sortKeys->abbrev_abort = NULL;
1145 }
1146 
1147 /*
1148  * tuplesort_end
1149  *
1150  * Release resources and clean up.
1151  *
1152  * NOTE: after calling this, any pointers returned by tuplesort_getXXX are
1153  * pointing to garbage. Be careful not to attempt to use or free such
1154  * pointers afterwards!
1155  */
1156 void
1158 {
1159  /* context swap probably not needed, but let's be safe */
1160  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
1161 
1162 #ifdef TRACE_SORT
1163  long spaceUsed;
1164 
1165  if (state->tapeset)
1166  spaceUsed = LogicalTapeSetBlocks(state->tapeset);
1167  else
1168  spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
1169 #endif
1170 
1171  /*
1172  * Delete temporary "tape" files, if any.
1173  *
1174  * Note: want to include this in reported total cost of sort, hence need
1175  * for two #ifdef TRACE_SORT sections.
1176  */
1177  if (state->tapeset)
1178  LogicalTapeSetClose(state->tapeset);
1179 
1180 #ifdef TRACE_SORT
1181  if (trace_sort)
1182  {
1183  if (state->tapeset)
1184  elog(LOG, "external sort ended, %ld disk blocks used: %s",
1185  spaceUsed, pg_rusage_show(&state->ru_start));
1186  else
1187  elog(LOG, "internal sort ended, %ld KB used: %s",
1188  spaceUsed, pg_rusage_show(&state->ru_start));
1189  }
1190 
1191  TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, spaceUsed);
1192 #else
1193 
1194  /*
1195  * If you disabled TRACE_SORT, you can still probe sort__done, but you
1196  * ain't getting space-used stats.
1197  */
1198  TRACE_POSTGRESQL_SORT_DONE(state->tapeset != NULL, 0L);
1199 #endif
1200 
1201  /* Free any execution state created for CLUSTER case */
1202  if (state->estate != NULL)
1203  {
1204  ExprContext *econtext = GetPerTupleExprContext(state->estate);
1205 
1207  FreeExecutorState(state->estate);
1208  }
1209 
1210  MemoryContextSwitchTo(oldcontext);
1211 
1212  /*
1213  * Free the per-sort memory context, thereby releasing all working memory,
1214  * including the Tuplesortstate struct itself.
1215  */
1217 }
1218 
1219 /*
1220  * Grow the memtuples[] array, if possible within our memory constraint. We
1221  * must not exceed INT_MAX tuples in memory or the caller-provided memory
1222  * limit. Return TRUE if we were able to enlarge the array, FALSE if not.
1223  *
1224  * Normally, at each increment we double the size of the array. When doing
1225  * that would exceed a limit, we attempt one last, smaller increase (and then
1226  * clear the growmemtuples flag so we don't try any more). That allows us to
1227  * use memory as fully as permitted; sticking to the pure doubling rule could
1228  * result in almost half going unused. Because availMem moves around with
1229  * tuple addition/removal, we need some rule to prevent making repeated small
1230  * increases in memtupsize, which would just be useless thrashing. The
1231  * growmemtuples flag accomplishes that and also prevents useless
1232  * recalculations in this function.
1233  */
1234 static bool
1236 {
1237  int newmemtupsize;
1238  int memtupsize = state->memtupsize;
1239  int64 memNowUsed = state->allowedMem - state->availMem;
1240 
1241  /* Forget it if we've already maxed out memtuples, per comment above */
1242  if (!state->growmemtuples)
1243  return false;
1244 
1245  /* Select new value of memtupsize */
1246  if (memNowUsed <= state->availMem)
1247  {
1248  /*
1249  * We've used no more than half of allowedMem; double our usage,
1250  * clamping at INT_MAX tuples.
1251  */
1252  if (memtupsize < INT_MAX / 2)
1253  newmemtupsize = memtupsize * 2;
1254  else
1255  {
1256  newmemtupsize = INT_MAX;
1257  state->growmemtuples = false;
1258  }
1259  }
1260  else
1261  {
1262  /*
1263  * This will be the last increment of memtupsize. Abandon doubling
1264  * strategy and instead increase as much as we safely can.
1265  *
1266  * To stay within allowedMem, we can't increase memtupsize by more
1267  * than availMem / sizeof(SortTuple) elements. In practice, we want
1268  * to increase it by considerably less, because we need to leave some
1269  * space for the tuples to which the new array slots will refer. We
1270  * assume the new tuples will be about the same size as the tuples
1271  * we've already seen, and thus we can extrapolate from the space
1272  * consumption so far to estimate an appropriate new size for the
1273  * memtuples array. The optimal value might be higher or lower than
1274  * this estimate, but it's hard to know that in advance. We again
1275  * clamp at INT_MAX tuples.
1276  *
1277  * This calculation is safe against enlarging the array so much that
1278  * LACKMEM becomes true, because the memory currently used includes
1279  * the present array; thus, there would be enough allowedMem for the
1280  * new array elements even if no other memory were currently used.
1281  *
1282  * We do the arithmetic in float8, because otherwise the product of
1283  * memtupsize and allowedMem could overflow. Any inaccuracy in the
1284  * result should be insignificant; but even if we computed a
1285  * completely insane result, the checks below will prevent anything
1286  * really bad from happening.
1287  */
1288  double grow_ratio;
1289 
1290  grow_ratio = (double) state->allowedMem / (double) memNowUsed;
1291  if (memtupsize * grow_ratio < INT_MAX)
1292  newmemtupsize = (int) (memtupsize * grow_ratio);
1293  else
1294  newmemtupsize = INT_MAX;
1295 
1296  /* We won't make any further enlargement attempts */
1297  state->growmemtuples = false;
1298  }
1299 
1300  /* Must enlarge array by at least one element, else report failure */
1301  if (newmemtupsize <= memtupsize)
1302  goto noalloc;
1303 
1304  /*
1305  * On a 32-bit machine, allowedMem could exceed MaxAllocHugeSize. Clamp
1306  * to ensure our request won't be rejected. Note that we can easily
1307  * exhaust address space before facing this outcome. (This is presently
1308  * impossible due to guc.c's MAX_KILOBYTES limitation on work_mem, but
1309  * don't rely on that at this distance.)
1310  */
1311  if ((Size) newmemtupsize >= MaxAllocHugeSize / sizeof(SortTuple))
1312  {
1313  newmemtupsize = (int) (MaxAllocHugeSize / sizeof(SortTuple));
1314  state->growmemtuples = false; /* can't grow any more */
1315  }
1316 
1317  /*
1318  * We need to be sure that we do not cause LACKMEM to become true, else
1319  * the space management algorithm will go nuts. The code above should
1320  * never generate a dangerous request, but to be safe, check explicitly
1321  * that the array growth fits within availMem. (We could still cause
1322  * LACKMEM if the memory chunk overhead associated with the memtuples
1323  * array were to increase. That shouldn't happen because we chose the
1324  * initial array size large enough to ensure that palloc will be treating
1325  * both old and new arrays as separate chunks. But we'll check LACKMEM
1326  * explicitly below just in case.)
1327  */
1328  if (state->availMem < (int64) ((newmemtupsize - memtupsize) * sizeof(SortTuple)))
1329  goto noalloc;
1330 
1331  /* OK, do it */
1332  FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
1333  state->memtupsize = newmemtupsize;
1334  state->memtuples = (SortTuple *)
1335  repalloc_huge(state->memtuples,
1336  state->memtupsize * sizeof(SortTuple));
1337  USEMEM(state, GetMemoryChunkSpace(state->memtuples));
1338  if (LACKMEM(state))
1339  elog(ERROR, "unexpected out-of-memory situation in tuplesort");
1340  return true;
1341 
1342 noalloc:
1343  /* If for any reason we didn't realloc, shut off future attempts */
1344  state->growmemtuples = false;
1345  return false;
1346 }
1347 
1348 /*
1349  * Accept one tuple while collecting input data for sort.
1350  *
1351  * Note that the input data is always copied; the caller need not save it.
1352  */
1353 void
1355 {
1356  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
1357  SortTuple stup;
1358 
1359  /*
1360  * Copy the given tuple into memory we control, and decrease availMem.
1361  * Then call the common code.
1362  */
1363  COPYTUP(state, &stup, (void *) slot);
1364 
1365  puttuple_common(state, &stup);
1366 
1367  MemoryContextSwitchTo(oldcontext);
1368 }
1369 
1370 /*
1371  * Accept one tuple while collecting input data for sort.
1372  *
1373  * Note that the input data is always copied; the caller need not save it.
1374  */
1375 void
1377 {
1378  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
1379  SortTuple stup;
1380 
1381  /*
1382  * Copy the given tuple into memory we control, and decrease availMem.
1383  * Then call the common code.
1384  */
1385  COPYTUP(state, &stup, (void *) tup);
1386 
1387  puttuple_common(state, &stup);
1388 
1389  MemoryContextSwitchTo(oldcontext);
1390 }
1391 
1392 /*
1393  * Collect one index tuple while collecting input data for sort, building
1394  * it from caller-supplied values.
1395  */
1396 void
1398  ItemPointer self, Datum *values,
1399  bool *isnull)
1400 {
1401  MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
1402  SortTuple stup;
1403  Datum original;
1404  IndexTuple tuple;
1405 
1406  stup.tuple = index_form_tuple(RelationGetDescr(rel), values, isnull);
1407  tuple = ((IndexTuple) stup.tuple);
1408  tuple->t_tid = *self;
1409  USEMEM(state, GetMemoryChunkSpace(stup.tuple));
1410  /* set up first-column key value */
1411  original = index_getattr(tuple,
1412  1,
1413  RelationGetDescr(state->indexRel),
1414  &stup.isnull1);
1415 
1417 
1418  if (!state->sortKeys || !state->sortKeys->abbrev_converter || stup.isnull1)
1419  {
1420  /*
1421  * Store ordinary Datum representation, or NULL value. If there is a
1422  * converter it won't expect NULL values, and cost model is not
1423  * required to account for NULL, so in that case we avoid calling
1424  * converter and just set datum1 to zeroed representation (to be
1425  * consistent, and to support cheap inequality tests for NULL
1426  * abbreviated keys).
1427  */
1428  stup.datum1 = original;
1429  }
1430  else if (!consider_abort_common(state))
1431  {
1432  /* Store abbreviated key representation */
1433  stup.datum1 = state->sortKeys->abbrev_converter(original,
1434  state->sortKeys);
1435  }
1436  else
1437  {
1438  /* Abort abbreviation */
1439  int i;
1440 
1441  stup.datum1 = original;
1442 
1443  /*
1444  * Set state to be consistent with never trying abbreviation.
1445  *
1446  * Alter datum1 representation in already-copied tuples, so as to
1447  * ensure a consistent representation (current tuple was just
1448  * handled). It does not matter if some dumped tuples are already
1449  * sorted on tape, since serialized tuples lack abbreviated keys
1450  * (TSS_BUILDRUNS state prevents control reaching here in any case).
1451  */
1452  for (i = 0; i < state->memtupcount; i++)
1453  {
1454  SortTuple *mtup = &state->memtuples[i];
1455 
1456  tuple = mtup->tuple;
1457  mtup->datum1 = index_getattr(tuple,
1458  1,
1459  RelationGetDescr(state->indexRel),
1460  &mtup->isnull1);
1461  }
1462  }
1463 
1464  puttuple_common(state, &stup);
1465 
1466  MemoryContextSwitchTo(oldcontext);
1467 }
1468 
1469 /*
1470  * Accept one Datum while collecting input data for sort.
1471  *
1472  * If the Datum is pass-by-ref type, the value will be copied.
1473  */
1474 void
1476 {
1477  MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
1478  SortTuple stup;
1479 
1480  /*
1481  * Pass-by-value types or null values are just stored directly in
1482  * stup.datum1 (and stup.tuple is not used and set to NULL).
1483  *
1484  * Non-null pass-by-reference values need to be copied into memory we
1485  * control, and possibly abbreviated. The copied value is pointed to by
1486  * stup.tuple and is treated as the canonical copy (e.g. to return via
1487  * tuplesort_getdatum or when writing to tape); stup.datum1 gets the
1488  * abbreviated value if abbreviation is happening, otherwise it's
1489  * identical to stup.tuple.
1490  */
1491 
1492  if (isNull || !state->tuples)
1493  {
1494  /*
1495  * Set datum1 to zeroed representation for NULLs (to be consistent,
1496  * and to support cheap inequality tests for NULL abbreviated keys).
1497  */
1498  stup.datum1 = !isNull ? val : (Datum) 0;
1499  stup.isnull1 = isNull;
1500  stup.tuple = NULL; /* no separate storage */
1502  }
1503  else
1504  {
1505  Datum original = datumCopy(val, false, state->datumTypeLen);
1506 
1507  stup.isnull1 = false;
1508  stup.tuple = DatumGetPointer(original);
1509  USEMEM(state, GetMemoryChunkSpace(stup.tuple));
1511 
1512  if (!state->sortKeys->abbrev_converter)
1513  {
1514  stup.datum1 = original;
1515  }
1516  else if (!consider_abort_common(state))
1517  {
1518  /* Store abbreviated key representation */
1519  stup.datum1 = state->sortKeys->abbrev_converter(original,
1520  state->sortKeys);
1521  }
1522  else
1523  {
1524  /* Abort abbreviation */
1525  int i;
1526 
1527  stup.datum1 = original;
1528 
1529  /*
1530  * Set state to be consistent with never trying abbreviation.
1531  *
1532  * Alter datum1 representation in already-copied tuples, so as to
1533  * ensure a consistent representation (current tuple was just
1534  * handled). It does not matter if some dumped tuples are already
1535  * sorted on tape, since serialized tuples lack abbreviated keys
1536  * (TSS_BUILDRUNS state prevents control reaching here in any
1537  * case).
1538  */
1539  for (i = 0; i < state->memtupcount; i++)
1540  {
1541  SortTuple *mtup = &state->memtuples[i];
1542 
1543  mtup->datum1 = PointerGetDatum(mtup->tuple);
1544  }
1545  }
1546  }
1547 
1548  puttuple_common(state, &stup);
1549 
1550  MemoryContextSwitchTo(oldcontext);
1551 }
1552 
1553 /*
1554  * Shared code for tuple and datum cases.
1555  */
1556 static void
1558 {
1559  switch (state->status)
1560  {
1561  case TSS_INITIAL:
1562 
1563  /*
1564  * Save the tuple into the unsorted array. First, grow the array
1565  * as needed. Note that we try to grow the array when there is
1566  * still one free slot remaining --- if we fail, there'll still be
1567  * room to store the incoming tuple, and then we'll switch to
1568  * tape-based operation.
1569  */
1570  if (state->memtupcount >= state->memtupsize - 1)
1571  {
1572  (void) grow_memtuples(state);
1573  Assert(state->memtupcount < state->memtupsize);
1574  }
1575  state->memtuples[state->memtupcount++] = *tuple;
1576 
1577  /*
1578  * Check if it's time to switch over to a bounded heapsort. We do
1579  * so if the input tuple count exceeds twice the desired tuple
1580  * count (this is a heuristic for where heapsort becomes cheaper
1581  * than a quicksort), or if we've just filled workMem and have
1582  * enough tuples to meet the bound.
1583  *
1584  * Note that once we enter TSS_BOUNDED state we will always try to
1585  * complete the sort that way. In the worst case, if later input
1586  * tuples are larger than earlier ones, this might cause us to
1587  * exceed workMem significantly.
1588  */
1589  if (state->bounded &&
1590  (state->memtupcount > state->bound * 2 ||
1591  (state->memtupcount > state->bound && LACKMEM(state))))
1592  {
1593 #ifdef TRACE_SORT
1594  if (trace_sort)
1595  elog(LOG, "switching to bounded heapsort at %d tuples: %s",
1596  state->memtupcount,
1597  pg_rusage_show(&state->ru_start));
1598 #endif
1599  make_bounded_heap(state);
1600  return;
1601  }
1602 
1603  /*
1604  * Done if we still fit in available memory and have array slots.
1605  */
1606  if (state->memtupcount < state->memtupsize && !LACKMEM(state))
1607  return;
1608 
1609  /*
1610  * Nope; time to switch to tape-based operation.
1611  */
1612  inittapes(state);
1613 
1614  /*
1615  * Dump tuples until we are back under the limit.
1616  */
1617  dumptuples(state, false);
1618  break;
1619 
1620  case TSS_BOUNDED:
1621 
1622  /*
1623  * We don't want to grow the array here, so check whether the new
1624  * tuple can be discarded before putting it in. This should be a
1625  * good speed optimization, too, since when there are many more
1626  * input tuples than the bound, most input tuples can be discarded
1627  * with just this one comparison. Note that because we currently
1628  * have the sort direction reversed, we must check for <= not >=.
1629  */
1630  if (COMPARETUP(state, tuple, &state->memtuples[0]) <= 0)
1631  {
1632  /* new tuple <= top of the heap, so we can discard it */
1633  free_sort_tuple(state, tuple);
1635  }
1636  else
1637  {
1638  /* discard top of heap, replacing it with the new tuple */
1639  free_sort_tuple(state, &state->memtuples[0]);
1640  tuple->tupindex = 0; /* not used */
1641  tuplesort_heap_replace_top(state, tuple, false);
1642  }
1643  break;
1644 
1645  case TSS_BUILDRUNS:
1646 
1647  /*
1648  * Insert the tuple into the heap, with run number currentRun if
1649  * it can go into the current run, else HEAP_RUN_NEXT. The tuple
1650  * can go into the current run if it is >= the first
1651  * not-yet-output tuple. (Actually, it could go into the current
1652  * run if it is >= the most recently output tuple ... but that
1653  * would require keeping around the tuple we last output, and it's
1654  * simplest to let writetup free each tuple as soon as it's
1655  * written.)
1656  *
1657  * Note that this only applies when:
1658  *
1659  * - currentRun is RUN_FIRST
1660  *
1661  * - Replacement selection is in use (typically it is never used).
1662  *
1663  * When these two conditions are not both true, all tuples are
1664  * appended indifferently, much like the TSS_INITIAL case.
1665  *
1666  * There should always be room to store the incoming tuple.
1667  */
1668  Assert(!state->replaceActive || state->memtupcount > 0);
1669  if (state->replaceActive &&
1670  COMPARETUP(state, tuple, &state->memtuples[0]) >= 0)
1671  {
1672  Assert(state->currentRun == RUN_FIRST);
1673 
1674  /*
1675  * Insert tuple into first, fully heapified run.
1676  *
1677  * Unlike classic replacement selection, which this module was
1678  * previously based on, only RUN_FIRST tuples are fully
1679  * heapified. Any second/next run tuples are appended
1680  * indifferently. While HEAP_RUN_NEXT tuples may be sifted
1681  * out of the way of first run tuples, COMPARETUP() will never
1682  * be called for the run's tuples during sifting (only our
1683  * initial COMPARETUP() call is required for the tuple, to
1684  * determine that the tuple does not belong in RUN_FIRST).
1685  */
1686  tuple->tupindex = state->currentRun;
1687  tuplesort_heap_insert(state, tuple, true);
1688  }
1689  else
1690  {
1691  /*
1692  * Tuple was determined to not belong to heapified RUN_FIRST,
1693  * or replacement selection not in play. Append the tuple to
1694  * memtuples indifferently.
1695  *
1696  * dumptuples() does not trust that the next run's tuples are
1697  * heapified. Anything past the first run will always be
1698  * quicksorted even when replacement selection is initially
1699  * used. (When it's never used, every tuple still takes this
1700  * path.)
1701  */
1702  tuple->tupindex = HEAP_RUN_NEXT;
1703  state->memtuples[state->memtupcount++] = *tuple;
1704  }
1705 
1706  /*
1707  * If we are over the memory limit, dump tuples till we're under.
1708  */
1709  dumptuples(state, false);
1710  break;
1711 
1712  default:
1713  elog(ERROR, "invalid tuplesort state");
1714  break;
1715  }
1716 }
1717 
1718 static bool
1720 {
1721  Assert(state->sortKeys[0].abbrev_converter != NULL);
1722  Assert(state->sortKeys[0].abbrev_abort != NULL);
1724 
1725  /*
1726  * Check effectiveness of abbreviation optimization. Consider aborting
1727  * when still within memory limit.
1728  */
1729  if (state->status == TSS_INITIAL &&
1730  state->memtupcount >= state->abbrevNext)
1731  {
1732  state->abbrevNext *= 2;
1733 
1734  /*
1735  * Check opclass-supplied abbreviation abort routine. It may indicate
1736  * that abbreviation should not proceed.
1737  */
1738  if (!state->sortKeys->abbrev_abort(state->memtupcount,
1739  state->sortKeys))
1740  return false;
1741 
1742  /*
1743  * Finally, restore authoritative comparator, and indicate that
1744  * abbreviation is not in play by setting abbrev_converter to NULL
1745  */
1746  state->sortKeys[0].comparator = state->sortKeys[0].abbrev_full_comparator;
1747  state->sortKeys[0].abbrev_converter = NULL;
1748  /* Not strictly necessary, but be tidy */
1749  state->sortKeys[0].abbrev_abort = NULL;
1750  state->sortKeys[0].abbrev_full_comparator = NULL;
1751 
1752  /* Give up - expect original pass-by-value representation */
1753  return true;
1754  }
1755 
1756  return false;
1757 }
1758 
1759 /*
1760  * All tuples have been provided; finish the sort.
1761  */
1762 void
1764 {
1765  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
1766 
1767 #ifdef TRACE_SORT
1768  if (trace_sort)
1769  elog(LOG, "performsort starting: %s",
1770  pg_rusage_show(&state->ru_start));
1771 #endif
1772 
1773  switch (state->status)
1774  {
1775  case TSS_INITIAL:
1776 
1777  /*
1778  * We were able to accumulate all the tuples within the allowed
1779  * amount of memory. Just qsort 'em and we're done.
1780  */
1781  tuplesort_sort_memtuples(state);
1782  state->current = 0;
1783  state->eof_reached = false;
1784  state->markpos_offset = 0;
1785  state->markpos_eof = false;
1786  state->status = TSS_SORTEDINMEM;
1787  break;
1788 
1789  case TSS_BOUNDED:
1790 
1791  /*
1792  * We were able to accumulate all the tuples required for output
1793  * in memory, using a heap to eliminate excess tuples. Now we
1794  * have to transform the heap to a properly-sorted array.
1795  */
1796  sort_bounded_heap(state);
1797  state->current = 0;
1798  state->eof_reached = false;
1799  state->markpos_offset = 0;
1800  state->markpos_eof = false;
1801  state->status = TSS_SORTEDINMEM;
1802  break;
1803 
1804  case TSS_BUILDRUNS:
1805 
1806  /*
1807  * Finish tape-based sort. First, flush all tuples remaining in
1808  * memory out to tape; then merge until we have a single remaining
1809  * run (or, if !randomAccess, one run per tape). Note that
1810  * mergeruns sets the correct state->status.
1811  */
1812  dumptuples(state, true);
1813  mergeruns(state);
1814  state->eof_reached = false;
1815  state->markpos_block = 0L;
1816  state->markpos_offset = 0;
1817  state->markpos_eof = false;
1818  break;
1819 
1820  default:
1821  elog(ERROR, "invalid tuplesort state");
1822  break;
1823  }
1824 
1825 #ifdef TRACE_SORT
1826  if (trace_sort)
1827  {
1828  if (state->status == TSS_FINALMERGE)
1829  elog(LOG, "performsort done (except %d-way final merge): %s",
1830  state->activeTapes,
1831  pg_rusage_show(&state->ru_start));
1832  else
1833  elog(LOG, "performsort done: %s",
1834  pg_rusage_show(&state->ru_start));
1835  }
1836 #endif
1837 
1838  MemoryContextSwitchTo(oldcontext);
1839 }
1840 
1841 /*
1842  * Internal routine to fetch the next tuple in either forward or back
1843  * direction into *stup. Returns FALSE if no more tuples.
1844  * Returned tuple belongs to tuplesort memory context, and must not be freed
1845  * by caller. Caller should not use tuple following next call here.
1846  */
1847 static bool
1849  SortTuple *stup)
1850 {
1851  unsigned int tuplen;
1852  size_t nmoved;
1853 
1854  switch (state->status)
1855  {
1856  case TSS_SORTEDINMEM:
1857  Assert(forward || state->randomAccess);
1858  Assert(!state->slabAllocatorUsed);
1859  if (forward)
1860  {
1861  if (state->current < state->memtupcount)
1862  {
1863  *stup = state->memtuples[state->current++];
1864  return true;
1865  }
1866  state->eof_reached = true;
1867 
1868  /*
1869  * Complain if caller tries to retrieve more tuples than
1870  * originally asked for in a bounded sort. This is because
1871  * returning EOF here might be the wrong thing.
1872  */
1873  if (state->bounded && state->current >= state->bound)
1874  elog(ERROR, "retrieved too many tuples in a bounded sort");
1875 
1876  return false;
1877  }
1878  else
1879  {
1880  if (state->current <= 0)
1881  return false;
1882 
1883  /*
1884  * if all tuples are fetched already then we return last
1885  * tuple, else - tuple before last returned.
1886  */
1887  if (state->eof_reached)
1888  state->eof_reached = false;
1889  else
1890  {
1891  state->current--; /* last returned tuple */
1892  if (state->current <= 0)
1893  return false;
1894  }
1895  *stup = state->memtuples[state->current - 1];
1896  return true;
1897  }
1898  break;
1899 
1900  case TSS_SORTEDONTAPE:
1901  Assert(forward || state->randomAccess);
1902  Assert(state->slabAllocatorUsed);
1903 
1904  /*
1905  * The slot that held the tuple that we returned in previous
1906  * gettuple call can now be reused.
1907  */
1908  if (state->lastReturnedTuple)
1909  {
1910  RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
1911  state->lastReturnedTuple = NULL;
1912  }
1913 
1914  if (forward)
1915  {
1916  if (state->eof_reached)
1917  return false;
1918 
1919  if ((tuplen = getlen(state, state->result_tape, true)) != 0)
1920  {
1921  READTUP(state, stup, state->result_tape, tuplen);
1922 
1923  /*
1924  * Remember the tuple we return, so that we can recycle
1925  * its memory on next call. (This can be NULL, in the
1926  * !state->tuples case).
1927  */
1928  state->lastReturnedTuple = stup->tuple;
1929 
1930  return true;
1931  }
1932  else
1933  {
1934  state->eof_reached = true;
1935  return false;
1936  }
1937  }
1938 
1939  /*
1940  * Backward.
1941  *
1942  * if all tuples are fetched already then we return last tuple,
1943  * else - tuple before last returned.
1944  */
1945  if (state->eof_reached)
1946  {
1947  /*
1948  * Seek position is pointing just past the zero tuplen at the
1949  * end of file; back up to fetch last tuple's ending length
1950  * word. If seek fails we must have a completely empty file.
1951  */
1952  nmoved = LogicalTapeBackspace(state->tapeset,
1953  state->result_tape,
1954  2 * sizeof(unsigned int));
1955  if (nmoved == 0)
1956  return false;
1957  else if (nmoved != 2 * sizeof(unsigned int))
1958  elog(ERROR, "unexpected tape position");
1959  state->eof_reached = false;
1960  }
1961  else
1962  {
1963  /*
1964  * Back up and fetch previously-returned tuple's ending length
1965  * word. If seek fails, assume we are at start of file.
1966  */
1967  nmoved = LogicalTapeBackspace(state->tapeset,
1968  state->result_tape,
1969  sizeof(unsigned int));
1970  if (nmoved == 0)
1971  return false;
1972  else if (nmoved != sizeof(unsigned int))
1973  elog(ERROR, "unexpected tape position");
1974  tuplen = getlen(state, state->result_tape, false);
1975 
1976  /*
1977  * Back up to get ending length word of tuple before it.
1978  */
1979  nmoved = LogicalTapeBackspace(state->tapeset,
1980  state->result_tape,
1981  tuplen + 2 * sizeof(unsigned int));
1982  if (nmoved == tuplen + sizeof(unsigned int))
1983  {
1984  /*
1985  * We backed up over the previous tuple, but there was no
1986  * ending length word before it. That means that the prev
1987  * tuple is the first tuple in the file. It is now the
1988  * next to read in forward direction (not obviously right,
1989  * but that is what in-memory case does).
1990  */
1991  return false;
1992  }
1993  else if (nmoved != tuplen + 2 * sizeof(unsigned int))
1994  elog(ERROR, "bogus tuple length in backward scan");
1995  }
1996 
1997  tuplen = getlen(state, state->result_tape, false);
1998 
1999  /*
2000  * Now we have the length of the prior tuple, back up and read it.
2001  * Note: READTUP expects we are positioned after the initial
2002  * length word of the tuple, so back up to that point.
2003  */
2004  nmoved = LogicalTapeBackspace(state->tapeset,
2005  state->result_tape,
2006  tuplen);
2007  if (nmoved != tuplen)
2008  elog(ERROR, "bogus tuple length in backward scan");
2009  READTUP(state, stup, state->result_tape, tuplen);
2010 
2011  /*
2012  * Remember the tuple we return, so that we can recycle its memory
2013  * on next call. (This can be NULL, in the Datum case).
2014  */
2015  state->lastReturnedTuple = stup->tuple;
2016 
2017  return true;
2018 
2019  case TSS_FINALMERGE:
2020  Assert(forward);
2021  /* We are managing memory ourselves, with the slab allocator. */
2022  Assert(state->slabAllocatorUsed);
2023 
2024  /*
2025  * The slab slot holding the tuple that we returned in previous
2026  * gettuple call can now be reused.
2027  */
2028  if (state->lastReturnedTuple)
2029  {
2030  RELEASE_SLAB_SLOT(state, state->lastReturnedTuple);
2031  state->lastReturnedTuple = NULL;
2032  }
2033 
2034  /*
2035  * This code should match the inner loop of mergeonerun().
2036  */
2037  if (state->memtupcount > 0)
2038  {
2039  int srcTape = state->memtuples[0].tupindex;
2040  SortTuple newtup;
2041 
2042  *stup = state->memtuples[0];
2043 
2044  /*
2045  * Remember the tuple we return, so that we can recycle its
2046  * memory on next call. (This can be NULL, in the Datum case).
2047  */
2048  state->lastReturnedTuple = stup->tuple;
2049 
2050  /*
2051  * Pull next tuple from tape, and replace the returned tuple
2052  * at top of the heap with it.
2053  */
2054  if (!mergereadnext(state, srcTape, &newtup))
2055  {
2056  /*
2057  * If no more data, we've reached end of run on this tape.
2058  * Remove the top node from the heap.
2059  */
2060  tuplesort_heap_delete_top(state, false);
2061 
2062  /*
2063  * Rewind to free the read buffer. It'd go away at the
2064  * end of the sort anyway, but better to release the
2065  * memory early.
2066  */
2067  LogicalTapeRewindForWrite(state->tapeset, srcTape);
2068  return true;
2069  }
2070  newtup.tupindex = srcTape;
2071  tuplesort_heap_replace_top(state, &newtup, false);
2072  return true;
2073  }
2074  return false;
2075 
2076  default:
2077  elog(ERROR, "invalid tuplesort state");
2078  return false; /* keep compiler quiet */
2079  }
2080 }
2081 
2082 /*
2083  * Fetch the next tuple in either forward or back direction.
2084  * If successful, put tuple in slot and return TRUE; else, clear the slot
2085  * and return FALSE.
2086  *
2087  * Caller may optionally be passed back abbreviated value (on TRUE return
2088  * value) when abbreviation was used, which can be used to cheaply avoid
2089  * equality checks that might otherwise be required. Caller can safely make a
2090  * determination of "non-equal tuple" based on simple binary inequality. A
2091  * NULL value in leading attribute will set abbreviated value to zeroed
2092  * representation, which caller may rely on in abbreviated inequality check.
2093  *
2094  * The slot receives a copied tuple (sometimes allocated in caller memory
2095  * context) that will stay valid regardless of future manipulations of the
2096  * tuplesort's state.
2097  */
2098 bool
2100  TupleTableSlot *slot, Datum *abbrev)
2101 {
2102  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
2103  SortTuple stup;
2104 
2105  if (!tuplesort_gettuple_common(state, forward, &stup))
2106  stup.tuple = NULL;
2107 
2108  MemoryContextSwitchTo(oldcontext);
2109 
2110  if (stup.tuple)
2111  {
2112  /* Record abbreviated key for caller */
2113  if (state->sortKeys->abbrev_converter && abbrev)
2114  *abbrev = stup.datum1;
2115 
2117  ExecStoreMinimalTuple((MinimalTuple) stup.tuple, slot, true);
2118  return true;
2119  }
2120  else
2121  {
2122  ExecClearTuple(slot);
2123  return false;
2124  }
2125 }
2126 
2127 /*
2128  * Fetch the next tuple in either forward or back direction.
2129  * Returns NULL if no more tuples. Returned tuple belongs to tuplesort memory
2130  * context, and must not be freed by caller. Caller should not use tuple
2131  * following next call here.
2132  */
2133 HeapTuple
2135 {
2136  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
2137  SortTuple stup;
2138 
2139  if (!tuplesort_gettuple_common(state, forward, &stup))
2140  stup.tuple = NULL;
2141 
2142  MemoryContextSwitchTo(oldcontext);
2143 
2144  return stup.tuple;
2145 }
2146 
2147 /*
2148  * Fetch the next index tuple in either forward or back direction.
2149  * Returns NULL if no more tuples. Returned tuple belongs to tuplesort memory
2150  * context, and must not be freed by caller. Caller should not use tuple
2151  * following next call here.
2152  */
2153 IndexTuple
2155 {
2156  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
2157  SortTuple stup;
2158 
2159  if (!tuplesort_gettuple_common(state, forward, &stup))
2160  stup.tuple = NULL;
2161 
2162  MemoryContextSwitchTo(oldcontext);
2163 
2164  return (IndexTuple) stup.tuple;
2165 }
2166 
2167 /*
2168  * Fetch the next Datum in either forward or back direction.
2169  * Returns FALSE if no more datums.
2170  *
2171  * If the Datum is pass-by-ref type, the returned value is freshly palloc'd
2172  * and is now owned by the caller (this differs from similar routines for
2173  * other types of tuplesorts).
2174  *
2175  * Caller may optionally be passed back abbreviated value (on TRUE return
2176  * value) when abbreviation was used, which can be used to cheaply avoid
2177  * equality checks that might otherwise be required. Caller can safely make a
2178  * determination of "non-equal tuple" based on simple binary inequality. A
2179  * NULL value will have a zeroed abbreviated value representation, which caller
2180  * may rely on in abbreviated inequality check.
2181  */
2182 bool
2184  Datum *val, bool *isNull, Datum *abbrev)
2185 {
2186  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
2187  SortTuple stup;
2188 
2189  if (!tuplesort_gettuple_common(state, forward, &stup))
2190  {
2191  MemoryContextSwitchTo(oldcontext);
2192  return false;
2193  }
2194 
2195  /* Record abbreviated key for caller */
2196  if (state->sortKeys->abbrev_converter && abbrev)
2197  *abbrev = stup.datum1;
2198 
2199  if (stup.isnull1 || !state->tuples)
2200  {
2201  *val = stup.datum1;
2202  *isNull = stup.isnull1;
2203  }
2204  else
2205  {
2206  /* use stup.tuple because stup.datum1 may be an abbreviation */
2207  *val = datumCopy(PointerGetDatum(stup.tuple), false, state->datumTypeLen);
2208  *isNull = false;
2209  }
2210 
2211  MemoryContextSwitchTo(oldcontext);
2212 
2213  return true;
2214 }
2215 
2216 /*
2217  * Advance over N tuples in either forward or back direction,
2218  * without returning any data. N==0 is a no-op.
2219  * Returns TRUE if successful, FALSE if ran out of tuples.
2220  */
2221 bool
2222 tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
2223 {
2224  MemoryContext oldcontext;
2225 
2226  /*
2227  * We don't actually support backwards skip yet, because no callers need
2228  * it. The API is designed to allow for that later, though.
2229  */
2230  Assert(forward);
2231  Assert(ntuples >= 0);
2232 
2233  switch (state->status)
2234  {
2235  case TSS_SORTEDINMEM:
2236  if (state->memtupcount - state->current >= ntuples)
2237  {
2238  state->current += ntuples;
2239  return true;
2240  }
2241  state->current = state->memtupcount;
2242  state->eof_reached = true;
2243 
2244  /*
2245  * Complain if caller tries to retrieve more tuples than
2246  * originally asked for in a bounded sort. This is because
2247  * returning EOF here might be the wrong thing.
2248  */
2249  if (state->bounded && state->current >= state->bound)
2250  elog(ERROR, "retrieved too many tuples in a bounded sort");
2251 
2252  return false;
2253 
2254  case TSS_SORTEDONTAPE:
2255  case TSS_FINALMERGE:
2256 
2257  /*
2258  * We could probably optimize these cases better, but for now it's
2259  * not worth the trouble.
2260  */
2261  oldcontext = MemoryContextSwitchTo(state->sortcontext);
2262  while (ntuples-- > 0)
2263  {
2264  SortTuple stup;
2265 
2266  if (!tuplesort_gettuple_common(state, forward, &stup))
2267  {
2268  MemoryContextSwitchTo(oldcontext);
2269  return false;
2270  }
2272  }
2273  MemoryContextSwitchTo(oldcontext);
2274  return true;
2275 
2276  default:
2277  elog(ERROR, "invalid tuplesort state");
2278  return false; /* keep compiler quiet */
2279  }
2280 }
2281 
2282 /*
2283  * tuplesort_merge_order - report merge order we'll use for given memory
2284  * (note: "merge order" just means the number of input tapes in the merge).
2285  *
2286  * This is exported for use by the planner. allowedMem is in bytes.
2287  */
2288 int
2289 tuplesort_merge_order(int64 allowedMem)
2290 {
2291  int mOrder;
2292 
2293  /*
2294  * We need one tape for each merge input, plus another one for the output,
2295  * and each of these tapes needs buffer space. In addition we want
2296  * MERGE_BUFFER_SIZE workspace per input tape (but the output tape doesn't
2297  * count).
2298  *
2299  * Note: you might be thinking we need to account for the memtuples[]
2300  * array in this calculation, but we effectively treat that as part of the
2301  * MERGE_BUFFER_SIZE workspace.
2302  */
2303  mOrder = (allowedMem - TAPE_BUFFER_OVERHEAD) /
2305 
2306  /*
2307  * Even in minimum memory, use at least a MINORDER merge. On the other
2308  * hand, even when we have lots of memory, do not use more than a MAXORDER
2309  * merge. Tapes are pretty cheap, but they're not entirely free. Each
2310  * additional tape reduces the amount of memory available to build runs,
2311  * which in turn can cause the same sort to need more runs, which makes
2312  * merging slower even if it can still be done in a single pass. Also,
2313  * high order merges are quite slow due to CPU cache effects; it can be
2314  * faster to pay the I/O cost of a polyphase merge than to perform a single
2315  * merge pass across many hundreds of tapes.
2316  */
2317  mOrder = Max(mOrder, MINORDER);
2318  mOrder = Min(mOrder, MAXORDER);
2319 
2320  return mOrder;
2321 }
2322 
2323 /*
2324  * useselection - determine algorithm to use to sort first run.
2325  *
2326  * It can sometimes be useful to use the replacement selection algorithm if it
2327  * results in one large run, and there is little available workMem. See
2328  * remarks on RUN_SECOND optimization within dumptuples().
2329  */
2330 static bool
2332 {
2333  /*
2334  * memtupsize might be noticeably higher than memtupcount here in atypical
2335  * cases. It seems slightly preferable to not allow recent outliers to
2336  * impact this determination. Note that caller's trace_sort output
2337  * reports memtupcount instead.
2338  */
2339  if (state->memtupsize <= replacement_sort_tuples)
2340  return true;
2341 
2342  return false;
2343 }
2344 
2345 /*
2346  * inittapes - initialize for tape sorting.
2347  *
2348  * This is called only if we have found we don't have room to sort in memory.
2349  */
2350 static void
2352 {
2353  int maxTapes,
2354  j;
2355  int64 tapeSpace;
2356 
2357  /* Compute number of tapes to use: merge order plus 1 */
2358  maxTapes = tuplesort_merge_order(state->allowedMem) + 1;
2359 
2360  state->maxTapes = maxTapes;
2361  state->tapeRange = maxTapes - 1;
2362 
2363 #ifdef TRACE_SORT
2364  if (trace_sort)
2365  elog(LOG, "switching to external sort with %d tapes: %s",
2366  maxTapes, pg_rusage_show(&state->ru_start));
2367 #endif
2368 
2369  /*
2370  * Decrease availMem to reflect the space needed for tape buffers, when
2371  * writing the initial runs; but don't decrease it to the point that we
2372  * have no room for tuples. (That case is only likely to occur if sorting
2373  * pass-by-value Datums; in all other scenarios the memtuples[] array is
2374  * unlikely to occupy more than half of allowedMem. In the pass-by-value
2375  * case it's not important to account for tuple space, so we don't care if
2376  * LACKMEM becomes inaccurate.)
2377  */
2378  tapeSpace = (int64) maxTapes *TAPE_BUFFER_OVERHEAD;
2379 
2380  if (tapeSpace + GetMemoryChunkSpace(state->memtuples) < state->allowedMem)
2381  USEMEM(state, tapeSpace);
2382 
2383  /*
2384  * Make sure that the temp file(s) underlying the tape set are created in
2385  * suitable temp tablespaces.
2386  */
2388 
2389  /*
2390  * Create the tape set and allocate the per-tape data arrays.
2391  */
2392  state->tapeset = LogicalTapeSetCreate(maxTapes);
2393 
2394  state->mergeactive = (bool *) palloc0(maxTapes * sizeof(bool));
2395  state->tp_fib = (int *) palloc0(maxTapes * sizeof(int));
2396  state->tp_runs = (int *) palloc0(maxTapes * sizeof(int));
2397  state->tp_dummy = (int *) palloc0(maxTapes * sizeof(int));
2398  state->tp_tapenum = (int *) palloc0(maxTapes * sizeof(int));
2399 
2400  /*
2401  * Give replacement selection a try based on user setting. There will be
2402  * a switch to a simple hybrid sort-merge strategy after the first run
2403  * (iff we could not output one long run).
2404  */
2405  state->replaceActive = useselection(state);
2406 
2407  if (state->replaceActive)
2408  {
2409  /*
2410  * Convert the unsorted contents of memtuples[] into a heap. Each
2411  * tuple is marked as belonging to run number zero.
2412  *
2413  * NOTE: we pass false for checkIndex since there's no point in
2414  * comparing indexes in this step, even though we do intend the
2415  * indexes to be part of the sort key...
2416  */
2417  int ntuples = state->memtupcount;
2418 
2419 #ifdef TRACE_SORT
2420  if (trace_sort)
2421  elog(LOG, "replacement selection will sort %d first run tuples",
2422  state->memtupcount);
2423 #endif
2424  state->memtupcount = 0; /* make the heap empty */
2425 
2426  for (j = 0; j < ntuples; j++)
2427  {
2428  /* Must copy source tuple to avoid possible overwrite */
2429  SortTuple stup = state->memtuples[j];
2430 
2431  stup.tupindex = RUN_FIRST;
2432  tuplesort_heap_insert(state, &stup, false);
2433  }
2434  Assert(state->memtupcount == ntuples);
2435  }
2436 
2437  state->currentRun = RUN_FIRST;
2438 
2439  /*
2440  * Initialize variables of Algorithm D (step D1).
2441  */
2442  for (j = 0; j < maxTapes; j++)
2443  {
2444  state->tp_fib[j] = 1;
2445  state->tp_runs[j] = 0;
2446  state->tp_dummy[j] = 1;
2447  state->tp_tapenum[j] = j;
2448  }
2449  state->tp_fib[state->tapeRange] = 0;
2450  state->tp_dummy[state->tapeRange] = 0;
2451 
2452  state->Level = 1;
2453  state->destTape = 0;
2454 
2455  state->status = TSS_BUILDRUNS;
2456 }
2457 
2458 /*
2459  * selectnewtape -- select new tape for new initial run.
2460  *
2461  * This is called after finishing a run when we know another run
2462  * must be started. This implements steps D3, D4 of Algorithm D.
2463  */
2464 static void
2466 {
2467  int j;
2468  int a;
2469 
2470  /* Step D3: advance j (destTape) */
2471  if (state->tp_dummy[state->destTape] < state->tp_dummy[state->destTape + 1])
2472  {
2473  state->destTape++;
2474  return;
2475  }
2476  if (state->tp_dummy[state->destTape] != 0)
2477  {
2478  state->destTape = 0;
2479  return;
2480  }
2481 
2482  /* Step D4: increase level */
2483  state->Level++;
2484  a = state->tp_fib[0];
2485  for (j = 0; j < state->tapeRange; j++)
2486  {
2487  state->tp_dummy[j] = a + state->tp_fib[j + 1] - state->tp_fib[j];
2488  state->tp_fib[j] = a + state->tp_fib[j + 1];
2489  }
2490  state->destTape = 0;
2491 }
2492 
2493 /*
2494  * Initialize the slab allocation arena, for the given number of slots.
2495  */
2496 static void
2498 {
2499  if (numSlots > 0)
2500  {
2501  char *p;
2502  int i;
2503 
2504  state->slabMemoryBegin = palloc(numSlots * SLAB_SLOT_SIZE);
2505  state->slabMemoryEnd = state->slabMemoryBegin +
2506  numSlots * SLAB_SLOT_SIZE;
2507  state->slabFreeHead = (SlabSlot *) state->slabMemoryBegin;
2508  USEMEM(state, numSlots * SLAB_SLOT_SIZE);
2509 
2510  p = state->slabMemoryBegin;
2511  for (i = 0; i < numSlots - 1; i++)
2512  {
2513  ((SlabSlot *) p)->nextfree = (SlabSlot *) (p + SLAB_SLOT_SIZE);
2514  p += SLAB_SLOT_SIZE;
2515  }
2516  ((SlabSlot *) p)->nextfree = NULL;
2517  }
2518  else
2519  {
2520  state->slabMemoryBegin = state->slabMemoryEnd = NULL;
2521  state->slabFreeHead = NULL;
2522  }
2523  state->slabAllocatorUsed = true;
2524 }
2525 
2526 /*
2527  * mergeruns -- merge all the completed initial runs.
2528  *
2529  * This implements steps D5, D6 of Algorithm D. All input data has
2530  * already been written to initial runs on tape (see dumptuples).
2531  */
2532 static void
2534 {
2535  int tapenum,
2536  svTape,
2537  svRuns,
2538  svDummy;
2539  int numTapes;
2540  int numInputTapes;
2541 
2542  Assert(state->status == TSS_BUILDRUNS);
2543  Assert(state->memtupcount == 0);
2544 
2545  if (state->sortKeys != NULL && state->sortKeys->abbrev_converter != NULL)
2546  {
2547  /*
2548  * If there are multiple runs to be merged, when we go to read back
2549  * tuples from disk, abbreviated keys will not have been stored, and
2550  * we don't care to regenerate them. Disable abbreviation from this
2551  * point on.
2552  */
2553  state->sortKeys->abbrev_converter = NULL;
2555 
2556  /* Not strictly necessary, but be tidy */
2557  state->sortKeys->abbrev_abort = NULL;
2559  }
2560 
2561  /*
2562  * Reset tuple memory. We've freed all the tuples that we previously
2563  * allocated. We will use the slab allocator from now on.
2564  */
2566  state->tuplecontext = NULL;
2567 
2568  /*
2569  * We no longer need a large memtuples array. (We will allocate a smaller
2570  * one for the heap later.)
2571  */
2572  FREEMEM(state, GetMemoryChunkSpace(state->memtuples));
2573  pfree(state->memtuples);
2574  state->memtuples = NULL;
2575 
2576  /*
2577  * If we had fewer runs than tapes, refund the memory that we imagined we
2578  * would need for the tape buffers of the unused tapes.
2579  *
2580  * numTapes and numInputTapes reflect the actual number of tapes we will
2581  * use. Note that the output tape's tape number is maxTapes - 1, so the
2582  * tape numbers of the used tapes are not consecutive, and you cannot just
2583  * loop from 0 to numTapes to visit all used tapes!
2584  */
2585  if (state->Level == 1)
2586  {
2587  numInputTapes = state->currentRun;
2588  numTapes = numInputTapes + 1;
2589  FREEMEM(state, (state->maxTapes - numTapes) * TAPE_BUFFER_OVERHEAD);
2590  }
2591  else
2592  {
2593  numInputTapes = state->tapeRange;
2594  numTapes = state->maxTapes;
2595  }
2596 
2597  /*
2598  * Initialize the slab allocator. We need one slab slot per input tape,
2599  * for the tuples in the heap, plus one to hold the tuple last returned
2600  * from tuplesort_gettuple. (If we're sorting pass-by-val Datums,
2601  * however, we don't need to do allocate anything.)
2602  *
2603  * From this point on, we no longer use the USEMEM()/LACKMEM() mechanism
2604  * to track memory usage of individual tuples.
2605  */
2606  if (state->tuples)
2607  init_slab_allocator(state, numInputTapes + 1);
2608  else
2609  init_slab_allocator(state, 0);
2610 
2611  /*
2612  * If we produced only one initial run (quite likely if the total data
2613  * volume is between 1X and 2X workMem when replacement selection is used,
2614  * but something we particular count on when input is presorted), we can
2615  * just use that tape as the finished output, rather than doing a useless
2616  * merge. (This obvious optimization is not in Knuth's algorithm.)
2617  */
2618  if (state->currentRun == RUN_SECOND)
2619  {
2620  state->result_tape = state->tp_tapenum[state->destTape];
2621  /* must freeze and rewind the finished output tape */
2622  LogicalTapeFreeze(state->tapeset, state->result_tape);
2623  state->status = TSS_SORTEDONTAPE;
2624  return;
2625  }
2626 
2627  /*
2628  * Allocate a new 'memtuples' array, for the heap. It will hold one tuple
2629  * from each input tape.
2630  */
2631  state->memtupsize = numInputTapes;
2632  state->memtuples = (SortTuple *) palloc(numInputTapes * sizeof(SortTuple));
2633  USEMEM(state, GetMemoryChunkSpace(state->memtuples));
2634 
2635  /*
2636  * Use all the remaining memory we have available for read buffers among
2637  * the input tapes.
2638  *
2639  * We do this only after checking for the case that we produced only one
2640  * initial run, because there is no need to use a large read buffer when
2641  * we're reading from a single tape. With one tape, the I/O pattern will
2642  * be the same regardless of the buffer size.
2643  *
2644  * We don't try to "rebalance" the memory among tapes, when we start a new
2645  * merge phase, even if some tapes are inactive in the new phase. That
2646  * would be hard, because logtape.c doesn't know where one run ends and
2647  * another begins. When a new merge phase begins, and a tape doesn't
2648  * participate in it, its buffer nevertheless already contains tuples from
2649  * the next run on same tape, so we cannot release the buffer. That's OK
2650  * in practice, merge performance isn't that sensitive to the amount of
2651  * buffers used, and most merge phases use all or almost all tapes,
2652  * anyway.
2653  */
2654 #ifdef TRACE_SORT
2655  if (trace_sort)
2656  elog(LOG, "using " INT64_FORMAT " KB of memory for read buffers among %d input tapes",
2657  (state->availMem) / 1024, numInputTapes);
2658 #endif
2659 
2660  state->read_buffer_size = Max(state->availMem / numInputTapes, 0);
2661  USEMEM(state, state->read_buffer_size * numInputTapes);
2662 
2663  /* End of step D2: rewind all output tapes to prepare for merging */
2664  for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
2665  LogicalTapeRewindForRead(state->tapeset, tapenum, state->read_buffer_size);
2666 
2667  for (;;)
2668  {
2669  /*
2670  * At this point we know that tape[T] is empty. If there's just one
2671  * (real or dummy) run left on each input tape, then only one merge
2672  * pass remains. If we don't have to produce a materialized sorted
2673  * tape, we can stop at this point and do the final merge on-the-fly.
2674  */
2675  if (!state->randomAccess)
2676  {
2677  bool allOneRun = true;
2678 
2679  Assert(state->tp_runs[state->tapeRange] == 0);
2680  for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
2681  {
2682  if (state->tp_runs[tapenum] + state->tp_dummy[tapenum] != 1)
2683  {
2684  allOneRun = false;
2685  break;
2686  }
2687  }
2688  if (allOneRun)
2689  {
2690  /* Tell logtape.c we won't be writing anymore */
2692  /* Initialize for the final merge pass */
2693  beginmerge(state);
2694  state->status = TSS_FINALMERGE;
2695  return;
2696  }
2697  }
2698 
2699  /* Step D5: merge runs onto tape[T] until tape[P] is empty */
2700  while (state->tp_runs[state->tapeRange - 1] ||
2701  state->tp_dummy[state->tapeRange - 1])
2702  {
2703  bool allDummy = true;
2704 
2705  for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
2706  {
2707  if (state->tp_dummy[tapenum] == 0)
2708  {
2709  allDummy = false;
2710  break;
2711  }
2712  }
2713 
2714  if (allDummy)
2715  {
2716  state->tp_dummy[state->tapeRange]++;
2717  for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
2718  state->tp_dummy[tapenum]--;
2719  }
2720  else
2721  mergeonerun(state);
2722  }
2723 
2724  /* Step D6: decrease level */
2725  if (--state->Level == 0)
2726  break;
2727  /* rewind output tape T to use as new input */
2728  LogicalTapeRewindForRead(state->tapeset, state->tp_tapenum[state->tapeRange],
2729  state->read_buffer_size);
2730  /* rewind used-up input tape P, and prepare it for write pass */
2731  LogicalTapeRewindForWrite(state->tapeset, state->tp_tapenum[state->tapeRange - 1]);
2732  state->tp_runs[state->tapeRange - 1] = 0;
2733 
2734  /*
2735  * reassign tape units per step D6; note we no longer care about A[]
2736  */
2737  svTape = state->tp_tapenum[state->tapeRange];
2738  svDummy = state->tp_dummy[state->tapeRange];
2739  svRuns = state->tp_runs[state->tapeRange];
2740  for (tapenum = state->tapeRange; tapenum > 0; tapenum--)
2741  {
2742  state->tp_tapenum[tapenum] = state->tp_tapenum[tapenum - 1];
2743  state->tp_dummy[tapenum] = state->tp_dummy[tapenum - 1];
2744  state->tp_runs[tapenum] = state->tp_runs[tapenum - 1];
2745  }
2746  state->tp_tapenum[0] = svTape;
2747  state->tp_dummy[0] = svDummy;
2748  state->tp_runs[0] = svRuns;
2749  }
2750 
2751  /*
2752  * Done. Knuth says that the result is on TAPE[1], but since we exited
2753  * the loop without performing the last iteration of step D6, we have not
2754  * rearranged the tape unit assignment, and therefore the result is on
2755  * TAPE[T]. We need to do it this way so that we can freeze the final
2756  * output tape while rewinding it. The last iteration of step D6 would be
2757  * a waste of cycles anyway...
2758  */
2759  state->result_tape = state->tp_tapenum[state->tapeRange];
2760  LogicalTapeFreeze(state->tapeset, state->result_tape);
2761  state->status = TSS_SORTEDONTAPE;
2762 
2763  /* Release the read buffers of all the other tapes, by rewinding them. */
2764  for (tapenum = 0; tapenum < state->maxTapes; tapenum++)
2765  {
2766  if (tapenum != state->result_tape)
2767  LogicalTapeRewindForWrite(state->tapeset, tapenum);
2768  }
2769 }
2770 
2771 /*
2772  * Merge one run from each input tape, except ones with dummy runs.
2773  *
2774  * This is the inner loop of Algorithm D step D5. We know that the
2775  * output tape is TAPE[T].
2776  */
2777 static void
2779 {
2780  int destTape = state->tp_tapenum[state->tapeRange];
2781  int srcTape;
2782 
2783  /*
2784  * Start the merge by loading one tuple from each active source tape into
2785  * the heap. We can also decrease the input run/dummy run counts.
2786  */
2787  beginmerge(state);
2788 
2789  /*
2790  * Execute merge by repeatedly extracting lowest tuple in heap, writing it
2791  * out, and replacing it with next tuple from same tape (if there is
2792  * another one).
2793  */
2794  while (state->memtupcount > 0)
2795  {
2796  SortTuple stup;
2797 
2798  /* write the tuple to destTape */
2799  srcTape = state->memtuples[0].tupindex;
2800  WRITETUP(state, destTape, &state->memtuples[0]);
2801 
2802  /* recycle the slot of the tuple we just wrote out, for the next read */
2803  if (state->memtuples[0].tuple)
2804  RELEASE_SLAB_SLOT(state, state->memtuples[0].tuple);
2805 
2806  /*
2807  * pull next tuple from the tape, and replace the written-out tuple in
2808  * the heap with it.
2809  */
2810  if (mergereadnext(state, srcTape, &stup))
2811  {
2812  stup.tupindex = srcTape;
2813  tuplesort_heap_replace_top(state, &stup, false);
2814 
2815  }
2816  else
2817  tuplesort_heap_delete_top(state, false);
2818  }
2819 
2820  /*
2821  * When the heap empties, we're done. Write an end-of-run marker on the
2822  * output tape, and increment its count of real runs.
2823  */
2824  markrunend(state, destTape);
2825  state->tp_runs[state->tapeRange]++;
2826 
2827 #ifdef TRACE_SORT
2828  if (trace_sort)
2829  elog(LOG, "finished %d-way merge step: %s", state->activeTapes,
2830  pg_rusage_show(&state->ru_start));
2831 #endif
2832 }
2833 
2834 /*
2835  * beginmerge - initialize for a merge pass
2836  *
2837  * We decrease the counts of real and dummy runs for each tape, and mark
2838  * which tapes contain active input runs in mergeactive[]. Then, fill the
2839  * merge heap with the first tuple from each active tape.
2840  */
2841 static void
2843 {
2844  int activeTapes;
2845  int tapenum;
2846  int srcTape;
2847 
2848  /* Heap should be empty here */
2849  Assert(state->memtupcount == 0);
2850 
2851  /* Adjust run counts and mark the active tapes */
2852  memset(state->mergeactive, 0,
2853  state->maxTapes * sizeof(*state->mergeactive));
2854  activeTapes = 0;
2855  for (tapenum = 0; tapenum < state->tapeRange; tapenum++)
2856  {
2857  if (state->tp_dummy[tapenum] > 0)
2858  state->tp_dummy[tapenum]--;
2859  else
2860  {
2861  Assert(state->tp_runs[tapenum] > 0);
2862  state->tp_runs[tapenum]--;
2863  srcTape = state->tp_tapenum[tapenum];
2864  state->mergeactive[srcTape] = true;
2865  activeTapes++;
2866  }
2867  }
2868  Assert(activeTapes > 0);
2869  state->activeTapes = activeTapes;
2870 
2871  /* Load the merge heap with the first tuple from each input tape */
2872  for (srcTape = 0; srcTape < state->maxTapes; srcTape++)
2873  {
2874  SortTuple tup;
2875 
2876  if (mergereadnext(state, srcTape, &tup))
2877  {
2878  tup.tupindex = srcTape;
2879  tuplesort_heap_insert(state, &tup, false);
2880  }
2881  }
2882 }
2883 
2884 /*
2885  * mergereadnext - read next tuple from one merge input tape
2886  *
2887  * Returns false on EOF.
2888  */
2889 static bool
2891 {
2892  unsigned int tuplen;
2893 
2894  if (!state->mergeactive[srcTape])
2895  return false; /* tape's run is already exhausted */
2896 
2897  /* read next tuple, if any */
2898  if ((tuplen = getlen(state, srcTape, true)) == 0)
2899  {
2900  state->mergeactive[srcTape] = false;
2901  return false;
2902  }
2903  READTUP(state, stup, srcTape, tuplen);
2904 
2905  return true;
2906 }
2907 
2908 /*
2909  * dumptuples - remove tuples from memtuples and write to tape
2910  *
2911  * This is used during initial-run building, but not during merging.
2912  *
2913  * When alltuples = false and replacement selection is still active, dump
2914  * only enough tuples to get under the availMem limit (and leave at least
2915  * one tuple in memtuples, since puttuple will then assume it is a heap that
2916  * has a tuple to compare to). We always insist there be at least one free
2917  * slot in the memtuples[] array.
2918  *
2919  * When alltuples = true, dump everything currently in memory. (This
2920  * case is only used at end of input data, although in practice only the
2921  * first run could fail to dump all tuples when we LACKMEM(), and only
2922  * when replacement selection is active.)
2923  *
2924  * If, when replacement selection is active, we see that the tuple run
2925  * number at the top of the heap has changed, start a new run. This must be
2926  * the first run, because replacement selection is always abandoned for all
2927  * further runs.
2928  */
2929 static void
2931 {
2932  while (alltuples ||
2933  (LACKMEM(state) && state->memtupcount > 1) ||
2934  state->memtupcount >= state->memtupsize)
2935  {
2936  if (state->replaceActive)
2937  {
2938  /*
2939  * Still holding out for a case favorable to replacement
2940  * selection. Still incrementally spilling using heap.
2941  *
2942  * Dump the heap's frontmost entry, and remove it from the heap.
2943  */
2944  Assert(state->memtupcount > 0);
2945  WRITETUP(state, state->tp_tapenum[state->destTape],
2946  &state->memtuples[0]);
2947  tuplesort_heap_delete_top(state, true);
2948  }
2949  else
2950  {
2951  /*
2952  * Once committed to quicksorting runs, never incrementally spill
2953  */
2954  dumpbatch(state, alltuples);
2955  break;
2956  }
2957 
2958  /*
2959  * If top run number has changed, we've finished the current run (this
2960  * can only be the first run), and will no longer spill incrementally.
2961  */
2962  if (state->memtupcount == 0 ||
2963  state->memtuples[0].tupindex == HEAP_RUN_NEXT)
2964  {
2965  markrunend(state, state->tp_tapenum[state->destTape]);
2966  Assert(state->currentRun == RUN_FIRST);
2967  state->currentRun++;
2968  state->tp_runs[state->destTape]++;
2969  state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
2970 
2971 #ifdef TRACE_SORT
2972  if (trace_sort)
2973  elog(LOG, "finished incrementally writing %s run %d to tape %d: %s",
2974  (state->memtupcount == 0) ? "only" : "first",
2975  state->currentRun, state->destTape,
2976  pg_rusage_show(&state->ru_start));
2977 #endif
2978 
2979  /*
2980  * Done if heap is empty, which is possible when there is only one
2981  * long run.
2982  */
2983  Assert(state->currentRun == RUN_SECOND);
2984  if (state->memtupcount == 0)
2985  {
2986  /*
2987  * Replacement selection best case; no final merge required,
2988  * because there was only one initial run (second run has no
2989  * tuples). See RUN_SECOND case in mergeruns().
2990  */
2991  break;
2992  }
2993 
2994  /*
2995  * Abandon replacement selection for second run (as well as any
2996  * subsequent runs).
2997  */
2998  state->replaceActive = false;
2999 
3000  /*
3001  * First tuple of next run should not be heapified, and so will
3002  * bear placeholder run number. In practice this must actually be
3003  * the second run, which just became the currentRun, so we're
3004  * clear to quicksort and dump the tuples in batch next time
3005  * memtuples becomes full.
3006  */
3007  Assert(state->memtuples[0].tupindex == HEAP_RUN_NEXT);
3008  selectnewtape(state);
3009  }
3010  }
3011 }
3012 
3013 /*
3014  * dumpbatch - sort and dump all memtuples, forming one run on tape
3015  *
3016  * Second or subsequent runs are never heapified by this module (although
3017  * heapification still respects run number differences between the first and
3018  * second runs), and a heap (replacement selection priority queue) is often
3019  * avoided in the first place.
3020  */
3021 static void
3022 dumpbatch(Tuplesortstate *state, bool alltuples)
3023 {
3024  int memtupwrite;
3025  int i;
3026 
3027  /*
3028  * Final call might require no sorting, in rare cases where we just so
3029  * happen to have previously LACKMEM()'d at the point where exactly all
3030  * remaining tuples are loaded into memory, just before input was
3031  * exhausted.
3032  *
3033  * In general, short final runs are quite possible. Rather than allowing
3034  * a special case where there was a superfluous selectnewtape() call (i.e.
3035  * a call with no subsequent run actually written to destTape), we prefer
3036  * to write out a 0 tuple run.
3037  *
3038  * mergereadnext() is prepared for 0 tuple runs, and will reliably mark
3039  * the tape inactive for the merge when called from beginmerge(). This
3040  * case is therefore similar to the case where mergeonerun() finds a dummy
3041  * run for the tape, and so doesn't need to merge a run from the tape (or
3042  * conceptually "merges" the dummy run, if you prefer). According to
3043  * Knuth, Algorithm D "isn't strictly optimal" in its method of
3044  * distribution and dummy run assignment; this edge case seems very
3045  * unlikely to make that appreciably worse.
3046  */
3047  Assert(state->status == TSS_BUILDRUNS);
3048 
3049  /*
3050  * It seems unlikely that this limit will ever be exceeded, but take no
3051  * chances
3052  */
3053  if (state->currentRun == INT_MAX)
3054  ereport(ERROR,
3055  (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
3056  errmsg("cannot have more than %d runs for an external sort",
3057  INT_MAX)));
3058 
3059  state->currentRun++;
3060 
3061 #ifdef TRACE_SORT
3062  if (trace_sort)
3063  elog(LOG, "starting quicksort of run %d: %s",
3064  state->currentRun, pg_rusage_show(&state->ru_start));
3065 #endif
3066 
3067  /*
3068  * Sort all tuples accumulated within the allowed amount of memory for
3069  * this run using quicksort
3070  */
3071  tuplesort_sort_memtuples(state);
3072 
3073 #ifdef TRACE_SORT
3074  if (trace_sort)
3075  elog(LOG, "finished quicksort of run %d: %s",
3076  state->currentRun, pg_rusage_show(&state->ru_start));
3077 #endif
3078 
3079  memtupwrite = state->memtupcount;
3080  for (i = 0; i < memtupwrite; i++)
3081  {
3082  WRITETUP(state, state->tp_tapenum[state->destTape],
3083  &state->memtuples[i]);
3084  state->memtupcount--;
3085  }
3086 
3087  /*
3088  * Reset tuple memory. We've freed all of the tuples that we previously
3089  * allocated. It's important to avoid fragmentation when there is a stark
3090  * change in the sizes of incoming tuples. Fragmentation due to
3091  * AllocSetFree's bucketing by size class might be particularly bad if
3092  * this step wasn't taken.
3093  */
3095 
3096  markrunend(state, state->tp_tapenum[state->destTape]);
3097  state->tp_runs[state->destTape]++;
3098  state->tp_dummy[state->destTape]--; /* per Alg D step D2 */
3099 
3100 #ifdef TRACE_SORT
3101  if (trace_sort)
3102  elog(LOG, "finished writing run %d to tape %d: %s",
3103  state->currentRun, state->destTape,
3104  pg_rusage_show(&state->ru_start));
3105 #endif
3106 
3107  if (!alltuples)
3108  selectnewtape(state);
3109 }
3110 
3111 /*
3112  * tuplesort_rescan - rewind and replay the scan
3113  */
3114 void
3116 {
3117  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
3118 
3119  Assert(state->randomAccess);
3120 
3121  switch (state->status)
3122  {
3123  case TSS_SORTEDINMEM:
3124  state->current = 0;
3125  state->eof_reached = false;
3126  state->markpos_offset = 0;
3127  state->markpos_eof = false;
3128  break;
3129  case TSS_SORTEDONTAPE:
3131  state->result_tape,
3132  0);
3133  state->eof_reached = false;
3134  state->markpos_block = 0L;
3135  state->markpos_offset = 0;
3136  state->markpos_eof = false;
3137  break;
3138  default:
3139  elog(ERROR, "invalid tuplesort state");
3140  break;
3141  }
3142 
3143  MemoryContextSwitchTo(oldcontext);
3144 }
3145 
3146 /*
3147  * tuplesort_markpos - saves current position in the merged sort file
3148  */
3149 void
3151 {
3152  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
3153 
3154  Assert(state->randomAccess);
3155 
3156  switch (state->status)
3157  {
3158  case TSS_SORTEDINMEM:
3159  state->markpos_offset = state->current;
3160  state->markpos_eof = state->eof_reached;
3161  break;
3162  case TSS_SORTEDONTAPE:
3163  LogicalTapeTell(state->tapeset,
3164  state->result_tape,
3165  &state->markpos_block,
3166  &state->markpos_offset);
3167  state->markpos_eof = state->eof_reached;
3168  break;
3169  default:
3170  elog(ERROR, "invalid tuplesort state");
3171  break;
3172  }
3173 
3174  MemoryContextSwitchTo(oldcontext);
3175 }
3176 
3177 /*
3178  * tuplesort_restorepos - restores current position in merged sort file to
3179  * last saved position
3180  */
3181 void
3183 {
3184  MemoryContext oldcontext = MemoryContextSwitchTo(state->sortcontext);
3185 
3186  Assert(state->randomAccess);
3187 
3188  switch (state->status)
3189  {
3190  case TSS_SORTEDINMEM:
3191  state->current = state->markpos_offset;
3192  state->eof_reached = state->markpos_eof;
3193  break;
3194  case TSS_SORTEDONTAPE:
3195  LogicalTapeSeek(state->tapeset,
3196  state->result_tape,
3197  state->markpos_block,
3198  state->markpos_offset);
3199  state->eof_reached = state->markpos_eof;
3200  break;
3201  default:
3202  elog(ERROR, "invalid tuplesort state");
3203  break;
3204  }
3205 
3206  MemoryContextSwitchTo(oldcontext);
3207 }
3208 
3209 /*
3210  * tuplesort_get_stats - extract summary statistics
3211  *
3212  * This can be called after tuplesort_performsort() finishes to obtain
3213  * printable summary information about how the sort was performed.
3214  * spaceUsed is measured in kilobytes.
3215  */
3216 void
3218  const char **sortMethod,
3219  const char **spaceType,
3220  long *spaceUsed)
3221 {
3222  /*
3223  * Note: it might seem we should provide both memory and disk usage for a
3224  * disk-based sort. However, the current code doesn't track memory space
3225  * accurately once we have begun to return tuples to the caller (since we
3226  * don't account for pfree's the caller is expected to do), so we cannot
3227  * rely on availMem in a disk sort. This does not seem worth the overhead
3228  * to fix. Is it worth creating an API for the memory context code to
3229  * tell us how much is actually used in sortcontext?
3230  */
3231  if (state->tapeset)
3232  {
3233  *spaceType = "Disk";
3234  *spaceUsed = LogicalTapeSetBlocks(state->tapeset) * (BLCKSZ / 1024);
3235  }
3236  else
3237  {
3238  *spaceType = "Memory";
3239  *spaceUsed = (state->allowedMem - state->availMem + 1023) / 1024;
3240  }
3241 
3242  switch (state->status)
3243  {
3244  case TSS_SORTEDINMEM:
3245  if (state->boundUsed)
3246  *sortMethod = "top-N heapsort";
3247  else
3248  *sortMethod = "quicksort";
3249  break;
3250  case TSS_SORTEDONTAPE:
3251  *sortMethod = "external sort";
3252  break;
3253  case TSS_FINALMERGE:
3254  *sortMethod = "external merge";
3255  break;
3256  default:
3257  *sortMethod = "still in progress";
3258  break;
3259  }
3260 }
3261 
3262 
3263 /*
3264  * Heap manipulation routines, per Knuth's Algorithm 5.2.3H.
3265  *
3266  * Compare two SortTuples. If checkIndex is true, use the tuple index
3267  * as the front of the sort key; otherwise, no.
3268  *
3269  * Note that for checkIndex callers, the heap invariant is never
3270  * maintained beyond the first run, and so there are no COMPARETUP()
3271  * calls needed to distinguish tuples in HEAP_RUN_NEXT.
3272  */
3273 
3274 #define HEAPCOMPARE(tup1,tup2) \
3275  (checkIndex && ((tup1)->tupindex != (tup2)->tupindex || \
3276  (tup1)->tupindex == HEAP_RUN_NEXT) ? \
3277  ((tup1)->tupindex) - ((tup2)->tupindex) : \
3278  COMPARETUP(state, tup1, tup2))
3279 
3280 /*
3281  * Convert the existing unordered array of SortTuples to a bounded heap,
3282  * discarding all but the smallest "state->bound" tuples.
3283  *
3284  * When working with a bounded heap, we want to keep the largest entry
3285  * at the root (array entry zero), instead of the smallest as in the normal
3286  * sort case. This allows us to discard the largest entry cheaply.
3287  * Therefore, we temporarily reverse the sort direction.
3288  *
3289  * We assume that all entries in a bounded heap will always have tupindex
3290  * zero; it therefore doesn't matter that HEAPCOMPARE() doesn't reverse
3291  * the direction of comparison for tupindexes.
3292  */
3293 static void
3295 {
3296  int tupcount = state->memtupcount;
3297  int i;
3298 
3299  Assert(state->status == TSS_INITIAL);
3300  Assert(state->bounded);
3301  Assert(tupcount >= state->bound);
3302 
3303  /* Reverse sort direction so largest entry will be at root */
3304  reversedirection(state);
3305 
3306  state->memtupcount = 0; /* make the heap empty */
3307  for (i = 0; i < tupcount; i++)
3308  {
3309  if (state->memtupcount < state->bound)
3310  {
3311  /* Insert next tuple into heap */
3312  /* Must copy source tuple to avoid possible overwrite */
3313  SortTuple stup = state->memtuples[i];
3314 
3315  stup.tupindex = 0; /* not used */
3316  tuplesort_heap_insert(state, &stup, false);
3317  }
3318  else
3319  {
3320  /*
3321  * The heap is full. Replace the largest entry with the new
3322  * tuple, or just discard it, if it's larger than anything already
3323  * in the heap.
3324  */
3325  if (COMPARETUP(state, &state->memtuples[i], &state->memtuples[0]) <= 0)
3326  {
3327  free_sort_tuple(state, &state->memtuples[i]);
3329  }
3330  else
3331  tuplesort_heap_replace_top(state, &state->memtuples[i], false);
3332  }
3333  }
3334 
3335  Assert(state->memtupcount == state->bound);
3336  state->status = TSS_BOUNDED;
3337 }
3338 
3339 /*
3340  * Convert the bounded heap to a properly-sorted array
3341  */
3342 static void
3344 {
3345  int tupcount = state->memtupcount;
3346 
3347  Assert(state->status == TSS_BOUNDED);
3348  Assert(state->bounded);
3349  Assert(tupcount == state->bound);
3350 
3351  /*
3352  * We can unheapify in place because each delete-top call will remove the
3353  * largest entry, which we can promptly store in the newly freed slot at
3354  * the end. Once we're down to a single-entry heap, we're done.
3355  */
3356  while (state->memtupcount > 1)
3357  {
3358  SortTuple stup = state->memtuples[0];
3359 
3360  /* this sifts-up the next-largest entry and decreases memtupcount */
3361  tuplesort_heap_delete_top(state, false);
3362  state->memtuples[state->memtupcount] = stup;
3363  }
3364  state->memtupcount = tupcount;
3365 
3366  /*
3367  * Reverse sort direction back to the original state. This is not
3368  * actually necessary but seems like a good idea for tidiness.
3369  */
3370  reversedirection(state);
3371 
3372  state->status = TSS_SORTEDINMEM;
3373  state->boundUsed = true;
3374 }
3375 
3376 /*
3377  * Sort all memtuples using specialized qsort() routines.
3378  *
3379  * Quicksort is used for small in-memory sorts. Quicksort is also generally
3380  * preferred to replacement selection for generating runs during external sort
3381  * operations, although replacement selection is sometimes used for the first
3382  * run.
3383  */
3384 static void
3386 {
3387  if (state->memtupcount > 1)
3388  {
3389  /* Can we use the single-key sort function? */
3390  if (state->onlyKey != NULL)
3391  qsort_ssup(state->memtuples, state->memtupcount,
3392  state->onlyKey);
3393  else
3394  qsort_tuple(state->memtuples,
3395  state->memtupcount,
3396  state->comparetup,
3397  state);
3398  }
3399 }
3400 
3401 /*
3402  * Insert a new tuple into an empty or existing heap, maintaining the
3403  * heap invariant. Caller is responsible for ensuring there's room.
3404  *
3405  * Note: For some callers, tuple points to a memtuples[] entry above the
3406  * end of the heap. This is safe as long as it's not immediately adjacent
3407  * to the end of the heap (ie, in the [memtupcount] array entry) --- if it
3408  * is, it might get overwritten before being moved into the heap!
3409  */
3410 static void
3412  bool checkIndex)
3413 {
3414  SortTuple *memtuples;
3415  int j;
3416 
3417  memtuples = state->memtuples;
3418  Assert(state->memtupcount < state->memtupsize);
3419  Assert(!checkIndex || tuple->tupindex == RUN_FIRST);
3420 
3422 
3423  /*
3424  * Sift-up the new entry, per Knuth 5.2.3 exercise 16. Note that Knuth is
3425  * using 1-based array indexes, not 0-based.
3426  */
3427  j = state->memtupcount++;
3428  while (j > 0)
3429  {
3430  int i = (j - 1) >> 1;
3431 
3432  if (HEAPCOMPARE(tuple, &memtuples[i]) >= 0)
3433  break;
3434  memtuples[j] = memtuples[i];
3435  j = i;
3436  }
3437  memtuples[j] = *tuple;
3438 }
3439 
3440 /*
3441  * Remove the tuple at state->memtuples[0] from the heap. Decrement
3442  * memtupcount, and sift up to maintain the heap invariant.
3443  *
3444  * The caller has already free'd the tuple the top node points to,
3445  * if necessary.
3446  */
3447 static void
3449 {
3450  SortTuple *memtuples = state->memtuples;
3451  SortTuple *tuple;
3452 
3453  Assert(!checkIndex || state->currentRun == RUN_FIRST);
3454  if (--state->memtupcount <= 0)
3455  return;
3456 
3457  /*
3458  * Remove the last tuple in the heap, and re-insert it, by replacing the
3459  * current top node with it.
3460  */
3461  tuple = &memtuples[state->memtupcount];
3462  tuplesort_heap_replace_top(state, tuple, checkIndex);
3463 }
3464 
3465 /*
3466  * Replace the tuple at state->memtuples[0] with a new tuple. Sift up to
3467  * maintain the heap invariant.
3468  *
3469  * This corresponds to Knuth's "sift-up" algorithm (Algorithm 5.2.3H,
3470  * Heapsort, steps H3-H8).
3471  */
3472 static void
3474  bool checkIndex)
3475 {
3476  SortTuple *memtuples = state->memtuples;
3477  int i,
3478  n;
3479 
3480  Assert(!checkIndex || state->currentRun == RUN_FIRST);
3481  Assert(state->memtupcount >= 1);
3482 
3484 
3485  n = state->memtupcount;
3486  i = 0; /* i is where the "hole" is */
3487  for (;;)
3488  {
3489  int j = 2 * i + 1;
3490 
3491  if (j >= n)
3492  break;
3493  if (j + 1 < n &&
3494  HEAPCOMPARE(&memtuples[j], &memtuples[j + 1]) > 0)
3495  j++;
3496  if (HEAPCOMPARE(tuple, &memtuples[j]) <= 0)
3497  break;
3498  memtuples[i] = memtuples[j];
3499  i = j;
3500  }
3501  memtuples[i] = *tuple;
3502 }
3503 
3504 /*
3505  * Function to reverse the sort direction from its current state
3506  *
3507  * It is not safe to call this when performing hash tuplesorts
3508  */
3509 static void
3511 {
3512  SortSupport sortKey = state->sortKeys;
3513  int nkey;
3514 
3515  for (nkey = 0; nkey < state->nKeys; nkey++, sortKey++)
3516  {
3517  sortKey->ssup_reverse = !sortKey->ssup_reverse;
3518  sortKey->ssup_nulls_first = !sortKey->ssup_nulls_first;
3519  }
3520 }
3521 
3522 
3523 /*
3524  * Tape interface routines
3525  */
3526 
3527 static unsigned int
3528 getlen(Tuplesortstate *state, int tapenum, bool eofOK)
3529 {
3530  unsigned int len;
3531 
3532  if (LogicalTapeRead(state->tapeset, tapenum,
3533  &len, sizeof(len)) != sizeof(len))
3534  elog(ERROR, "unexpected end of tape");
3535  if (len == 0 && !eofOK)
3536  elog(ERROR, "unexpected end of data");
3537  return len;
3538 }
3539 
3540 static void
3542 {
3543  unsigned int len = 0;
3544 
3545  LogicalTapeWrite(state->tapeset, tapenum, (void *) &len, sizeof(len));
3546 }
3547 
3548 /*
3549  * Get memory for tuple from within READTUP() routine.
3550  *
3551  * We use next free slot from the slab allocator, or palloc() if the tuple
3552  * is too large for that.
3553  */
3554 static void *
3556 {
3557  SlabSlot *buf;
3558 
3559  /*
3560  * We pre-allocate enough slots in the slab arena that we should never run
3561  * out.
3562  */
3563  Assert(state->slabFreeHead);
3564 
3565  if (tuplen > SLAB_SLOT_SIZE || !state->slabFreeHead)
3566  return MemoryContextAlloc(state->sortcontext, tuplen);
3567  else
3568  {
3569  buf = state->slabFreeHead;
3570  /* Reuse this slot */
3571  state->slabFreeHead = buf->nextfree;
3572 
3573  return buf;
3574  }
3575 }
3576 
3577 
3578 /*
3579  * Routines specialized for HeapTuple (actually MinimalTuple) case
3580  */
3581 
3582 static int
3584 {
3585  SortSupport sortKey = state->sortKeys;
3586  HeapTupleData ltup;
3587  HeapTupleData rtup;
3588  TupleDesc tupDesc;
3589  int nkey;
3590  int32 compare;
3591  AttrNumber attno;
3592  Datum datum1,
3593  datum2;
3594  bool isnull1,
3595  isnull2;
3596 
3597 
3598  /* Compare the leading sort key */
3599  compare = ApplySortComparator(a->datum1, a->isnull1,
3600  b->datum1, b->isnull1,
3601  sortKey);
3602  if (compare != 0)
3603  return compare;
3604 
3605  /* Compare additional sort keys */
3606  ltup.t_len = ((MinimalTuple) a->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
3607  ltup.t_data = (HeapTupleHeader) ((char *) a->tuple - MINIMAL_TUPLE_OFFSET);
3608  rtup.t_len = ((MinimalTuple) b->tuple)->t_len + MINIMAL_TUPLE_OFFSET;
3609  rtup.t_data = (HeapTupleHeader) ((char *) b->tuple - MINIMAL_TUPLE_OFFSET);
3610  tupDesc = state->tupDesc;
3611 
3612  if (sortKey->abbrev_converter)
3613  {
3614  attno = sortKey->ssup_attno;
3615 
3616  datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);
3617  datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);
3618 
3619  compare = ApplySortAbbrevFullComparator(datum1, isnull1,
3620  datum2, isnull2,
3621  sortKey);
3622  if (compare != 0)
3623  return compare;
3624  }
3625 
3626  sortKey++;
3627  for (nkey = 1; nkey < state->nKeys; nkey++, sortKey++)
3628  {
3629  attno = sortKey->ssup_attno;
3630 
3631  datum1 = heap_getattr(&ltup, attno, tupDesc, &isnull1);
3632  datum2 = heap_getattr(&rtup, attno, tupDesc, &isnull2);
3633 
3634  compare = ApplySortComparator(datum1, isnull1,
3635  datum2, isnull2,
3636  sortKey);
3637  if (compare != 0)
3638  return compare;
3639  }
3640 
3641  return 0;
3642 }
3643 
3644 static void
3646 {
3647  /*
3648  * We expect the passed "tup" to be a TupleTableSlot, and form a
3649  * MinimalTuple using the exported interface for that.
3650  */
3651  TupleTableSlot *slot = (TupleTableSlot *) tup;
3652  Datum original;
3653  MinimalTuple tuple;
3654  HeapTupleData htup;
3655  MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
3656 
3657  /* copy the tuple into sort storage */
3658  tuple = ExecCopySlotMinimalTuple(slot);
3659  stup->tuple = (void *) tuple;
3660  USEMEM(state, GetMemoryChunkSpace(tuple));
3661  /* set up first-column key value */
3662  htup.t_len = tuple->t_len + MINIMAL_TUPLE_OFFSET;
3663  htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET);
3664  original = heap_getattr(&htup,
3665  state->sortKeys[0].ssup_attno,
3666  state->tupDesc,
3667  &stup->isnull1);
3668 
3669  MemoryContextSwitchTo(oldcontext);
3670 
3671  if (!state->sortKeys->abbrev_converter || stup->isnull1)
3672  {
3673  /*
3674  * Store ordinary Datum representation, or NULL value. If there is a
3675  * converter it won't expect NULL values, and cost model is not
3676  * required to account for NULL, so in that case we avoid calling
3677  * converter and just set datum1 to zeroed representation (to be
3678  * consistent, and to support cheap inequality tests for NULL
3679  * abbreviated keys).
3680  */
3681  stup->datum1 = original;
3682  }
3683  else if (!consider_abort_common(state))
3684  {
3685  /* Store abbreviated key representation */
3686  stup->datum1 = state->sortKeys->abbrev_converter(original,
3687  state->sortKeys);
3688  }
3689  else
3690  {
3691  /* Abort abbreviation */
3692  int i;
3693 
3694  stup->datum1 = original;
3695 
3696  /*
3697  * Set state to be consistent with never trying abbreviation.
3698  *
3699  * Alter datum1 representation in already-copied tuples, so as to
3700  * ensure a consistent representation (current tuple was just
3701  * handled). It does not matter if some dumped tuples are already
3702  * sorted on tape, since serialized tuples lack abbreviated keys
3703  * (TSS_BUILDRUNS state prevents control reaching here in any case).
3704  */
3705  for (i = 0; i < state->memtupcount; i++)
3706  {
3707  SortTuple *mtup = &state->memtuples[i];
3708 
3709  htup.t_len = ((MinimalTuple) mtup->tuple)->t_len +
3711  htup.t_data = (HeapTupleHeader) ((char *) mtup->tuple -
3713 
3714  mtup->datum1 = heap_getattr(&htup,
3715  state->sortKeys[0].ssup_attno,
3716  state->tupDesc,
3717  &mtup->isnull1);
3718  }
3719  }
3720 }
3721 
3722 static void
3724 {
3725  MinimalTuple tuple = (MinimalTuple) stup->tuple;
3726 
3727  /* the part of the MinimalTuple we'll write: */
3728  char *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
3729  unsigned int tupbodylen = tuple->t_len - MINIMAL_TUPLE_DATA_OFFSET;
3730 
3731  /* total on-disk footprint: */
3732  unsigned int tuplen = tupbodylen + sizeof(int);
3733 
3734  LogicalTapeWrite(state->tapeset, tapenum,
3735  (void *) &tuplen, sizeof(tuplen));
3736  LogicalTapeWrite(state->tapeset, tapenum,
3737  (void *) tupbody, tupbodylen);
3738  if (state->randomAccess) /* need trailing length word? */
3739  LogicalTapeWrite(state->tapeset, tapenum,
3740  (void *) &tuplen, sizeof(tuplen));
3741 
3742  if (!state->slabAllocatorUsed)
3743  {
3744  FREEMEM(state, GetMemoryChunkSpace(tuple));
3745  heap_free_minimal_tuple(tuple);
3746  }
3747 }
3748 
3749 static void
3751  int tapenum, unsigned int len)
3752 {
3753  unsigned int tupbodylen = len - sizeof(int);
3754  unsigned int tuplen = tupbodylen + MINIMAL_TUPLE_DATA_OFFSET;
3755  MinimalTuple tuple = (MinimalTuple) readtup_alloc(state, tuplen);
3756  char *tupbody = (char *) tuple + MINIMAL_TUPLE_DATA_OFFSET;
3757  HeapTupleData htup;
3758 
3759  /* read in the tuple proper */
3760  tuple->t_len = tuplen;
3761  LogicalTapeReadExact(state->tapeset, tapenum,
3762  tupbody, tupbodylen);
3763  if (state->randomAccess) /* need trailing length word? */
3764  LogicalTapeReadExact(state->tapeset, tapenum,
3765  &tuplen, sizeof(tuplen));
3766  stup->tuple = (void *) tuple;
3767  /* set up first-column key value */
3768  htup.t_len = tuple->t_len + MINIMAL_TUPLE_OFFSET;
3769  htup.t_data = (HeapTupleHeader) ((char *) tuple - MINIMAL_TUPLE_OFFSET);
3770  stup->datum1 = heap_getattr(&htup,
3771  state->sortKeys[0].ssup_attno,
3772  state->tupDesc,
3773  &stup->isnull1);
3774 }
3775 
3776 /*
3777  * Routines specialized for the CLUSTER case (HeapTuple data, with
3778  * comparisons per a btree index definition)
3779  */
3780 
3781 static int
3784 {
3785  SortSupport sortKey = state->sortKeys;
3786  HeapTuple ltup;
3787  HeapTuple rtup;
3788  TupleDesc tupDesc;
3789  int nkey;
3790  int32 compare;
3791  Datum datum1,
3792  datum2;
3793  bool isnull1,
3794  isnull2;
3795  AttrNumber leading = state->indexInfo->ii_KeyAttrNumbers[0];
3796 
3797  /* Be prepared to compare additional sort keys */
3798  ltup = (HeapTuple) a->tuple;
3799  rtup = (HeapTuple) b->tuple;
3800  tupDesc = state->tupDesc;
3801 
3802  /* Compare the leading sort key, if it's simple */
3803  if (leading != 0)
3804  {
3805  compare = ApplySortComparator(a->datum1, a->isnull1,
3806  b->datum1, b->isnull1,
3807  sortKey);
3808  if (compare != 0)
3809  return compare;
3810 
3811  if (sortKey->abbrev_converter)
3812  {
3813  datum1 = heap_getattr(ltup, leading, tupDesc, &isnull1);
3814  datum2 = heap_getattr(rtup, leading, tupDesc, &isnull2);
3815 
3816  compare = ApplySortAbbrevFullComparator(datum1, isnull1,
3817  datum2, isnull2,
3818  sortKey);
3819  }
3820  if (compare != 0 || state->nKeys == 1)
3821  return compare;
3822  /* Compare additional columns the hard way */
3823  sortKey++;
3824  nkey = 1;
3825  }
3826  else
3827  {
3828  /* Must compare all keys the hard way */
3829  nkey = 0;
3830  }
3831 
3832  if (state->indexInfo->ii_Expressions == NULL)
3833  {
3834  /* If not expression index, just compare the proper heap attrs */
3835 
3836  for (; nkey < state->nKeys; nkey++, sortKey++)
3837  {
3838  AttrNumber attno = state->indexInfo->ii_KeyAttrNumbers[nkey];
3839 
3840  datum1 = heap_getattr(ltup, attno, tupDesc, &isnull1);
3841  datum2 = heap_getattr(rtup, attno, tupDesc, &isnull2);
3842 
3843  compare = ApplySortComparator(datum1, isnull1,
3844  datum2, isnull2,
3845  sortKey);
3846  if (compare != 0)
3847  return compare;
3848  }
3849  }
3850  else
3851  {
3852  /*
3853  * In the expression index case, compute the whole index tuple and
3854  * then compare values. It would perhaps be faster to compute only as
3855  * many columns as we need to compare, but that would require
3856  * duplicating all the logic in FormIndexDatum.
3857  */
3858  Datum l_index_values[INDEX_MAX_KEYS];
3859  bool l_index_isnull[INDEX_MAX_KEYS];
3860  Datum r_index_values[INDEX_MAX_KEYS];
3861  bool r_index_isnull[INDEX_MAX_KEYS];
3862  TupleTableSlot *ecxt_scantuple;
3863 
3864  /* Reset context each time to prevent memory leakage */
3866 
3867  ecxt_scantuple = GetPerTupleExprContext(state->estate)->ecxt_scantuple;
3868 
3869  ExecStoreTuple(ltup, ecxt_scantuple, InvalidBuffer, false);
3870  FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
3871  l_index_values, l_index_isnull);
3872 
3873  ExecStoreTuple(rtup, ecxt_scantuple, InvalidBuffer, false);
3874  FormIndexDatum(state->indexInfo, ecxt_scantuple, state->estate,
3875  r_index_values, r_index_isnull);
3876 
3877  for (; nkey < state->nKeys; nkey++, sortKey++)
3878  {
3879  compare = ApplySortComparator(l_index_values[nkey],
3880  l_index_isnull[nkey],
3881  r_index_values[nkey],
3882  r_index_isnull[nkey],
3883  sortKey);
3884  if (compare != 0)
3885  return compare;
3886  }
3887  }
3888 
3889  return 0;
3890 }
3891 
3892 static void
3894 {
3895  HeapTuple tuple = (HeapTuple) tup;
3896  Datum original;
3897  MemoryContext oldcontext = MemoryContextSwitchTo(state->tuplecontext);
3898 
3899  /* copy the tuple into sort storage */
3900  tuple = heap_copytuple(tuple);
3901  stup->tuple = (void *) tuple;
3902  USEMEM(state, GetMemoryChunkSpace(tuple));
3903 
3904  MemoryContextSwitchTo(oldcontext);
3905 
3906  /*
3907  * set up first-column key value, and potentially abbreviate, if it's a
3908  * simple column
3909  */
3910  if (state->indexInfo->ii_KeyAttrNumbers[0] == 0)
3911  return;
3912 
3913  original = heap_getattr(tuple,
3914  state->indexInfo->ii_KeyAttrNumbers[0],
3915  state->tupDesc,
3916  &stup->isnull1);
3917 
3918  if (!state->sortKeys->abbrev_converter || stup->isnull1)
3919  {
3920  /*
3921  * Store ordinary Datum representation, or NULL value. If there is a
3922  * converter it won't expect NULL values, and cost model is not
3923  * required to account for NULL, so in that case we avoid calling
3924  * converter and just set datum1 to zeroed representation (to be
3925  * consistent, and to support cheap inequality tests for NULL
3926  * abbreviated keys).
3927  */
3928  stup->datum1 = original;
3929  }
3930  else if (!consider_abort_common(state))
3931  {
3932  /* Store abbreviated key representation */
3933  stup->datum1 = state->sortKeys->abbrev_converter(original,
3934  state->sortKeys);
3935  }
3936  else
3937  {
3938  /* Abort abbreviation */
3939  int i;
3940 
3941  stup->datum1 = original;
3942 
3943  /*
3944  * Set state to be consistent with never trying abbreviation.
3945  *
3946  * Alter datum1 representation in already-copied tuples, so as to
3947  * ensure a consistent representation (current tuple was just
3948  * handled). It does not matter if some dumped tuples are already
3949  * sorted on tape, since serialized tuples lack abbreviated keys
3950  * (TSS_BUILDRUNS state prevents control reaching here in any case).
3951  */
3952  for (i = 0; i < state->memtupcount; i++)
3953  {
3954  SortTuple *mtup = &state->memtuples[i];
3955 
3956  tuple = (HeapTuple) mtup->tuple;
3957  mtup->datum1 = heap_getattr(tuple,
3958  state->indexInfo->ii_KeyAttrNumbers[0],
3959  state->tupDesc,
3960  &mtup->isnull1);
3961  }
3962  }
3963 }
3964 
3965 static void
3967 {
3968  HeapTuple tuple = (HeapTuple) stup->tuple;
3969  unsigned int tuplen = tuple->t_len + sizeof(ItemPointerData) + sizeof(int);
3970 
3971  /* We need to store t_self, but not other fields of HeapTupleData */
3972  LogicalTapeWrite(state->tapeset, tapenum,
3973  &tuplen, sizeof(tuplen));
3974  LogicalTapeWrite(state->tapeset, tapenum,
3975  &tuple->t_self, sizeof(ItemPointerData));
3976  LogicalTapeWrite(state->tapeset, tapenum,
3977  tuple->t_data, tuple->t_len);
3978  if (state->randomAccess) /* need trailing length word? */
3979  LogicalTapeWrite(state->tapeset, tapenum,
3980  &tuplen, sizeof(tuplen));
3981 
3982  if (!state->slabAllocatorUsed)
3983  {
3984  FREEMEM(state, GetMemoryChunkSpace(tuple));
3985  heap_freetuple(tuple);
3986  }
3987 }
3988 
3989 static void
3991  int tapenum, unsigned int tuplen)
3992 {
3993  unsigned int t_len = tuplen - sizeof(ItemPointerData) - sizeof(int);
3994  HeapTuple tuple = (HeapTuple) readtup_alloc(state,
3995  t_len + HEAPTUPLESIZE);
3996 
3997  /* Reconstruct the HeapTupleData header */
3998  tuple->t_data = (HeapTupleHeader) ((char *) tuple + HEAPTUPLESIZE);
3999  tuple->t_len = t_len;
4000  LogicalTapeReadExact(state->tapeset, tapenum,
4001  &tuple->t_self, sizeof(ItemPointerData));
4002  /* We don't currently bother to reconstruct t_tableOid */
4003  tuple->t_tableOid = InvalidOid;
4004  /* Read in the tuple body */
4005  LogicalTapeReadExact(state->tapeset, tapenum,
4006  tuple->t_data, tuple->t_len);
4007  if (state->randomAccess) /* need trailing length word? */
4008  LogicalTapeReadExact(state->tapeset, tapenum,
4009  &tuplen, sizeof(tuplen));
4010  stup->tuple = (void *) tuple;
4011  /* set up first-column key value, if it's a simple column */
4012  if (state->indexInfo->ii_KeyAttrNumbers[0] != 0)
4013  stup->datum1 = heap_getattr(tuple,
4014  state->indexInfo->ii_KeyAttrNumbers[0],
4015  state->tupDesc,
4016  &stup->isnull1);
4017 }
4018 
4019 /*
4020  * Routines specialized for IndexTuple case
4021  *
4022  * The btree and hash cases require separate comparison functions, but the
4023  * IndexTuple representation is the same so the copy/write/read support
4024  * functions can be shared.
4025  */
4026 
4027 static int
4030 {
4031  /*
4032  * This is similar to comparetup_heap(), but expects index tuples. There
4033  * is also special handling for enforcing uniqueness, and special
4034  * treatment for equal keys at the end.
4035  */
4036  SortSupport sortKey = state->sortKeys;
4037  IndexTuple tuple1;
4038  IndexTuple tuple2;
4039  int keysz;
4040  TupleDesc tupDes;
4041  bool equal_hasnull = false;
4042  int nkey;
4043  int32 compare;
4044  Datum datum1,
4045  datum2;
4046  bool isnull1,
4047  isnull2;
4048 
4049 
4050  /* Compare the leading sort key */
4051  compare = ApplySortComparator(a->datum1, a->isnull1,
4052  b->datum1, b->isnull1,
4053  sortKey);
4054  if (compare != 0)
4055  return compare;
4056 
4057  /* Compare additional sort keys */
4058  tuple1 = (IndexTuple) a->tuple;
4059  tuple2 = (IndexTuple) b->tuple;
4060  keysz = state->nKeys;
4061  tupDes = RelationGetDescr(state->indexRel);
4062 
4063  if (sortKey->abbrev_converter)
4064  {
4065  datum1 = index_getattr(tuple1, 1, tupDes, &isnull1);
4066  datum2 = index_getattr(tuple2, 1, tupDes, &isnull2);
4067 
4068  compare = ApplySortAbbrevFullComparator(datum1, isnull1,
4069  datum2, isnull2,
4070  sortKey);
4071  if (compare != 0)
4072  return compare;
4073  }
4074 
4075  /* they are equal, so we only need to examine one null flag */
4076  if (a->isnull1)
4077  equal_hasnull = true;
4078 
4079  sortKey++;
4080  for (nkey = 2; nkey <= keysz; nkey++, sortKey++)
4081  {
4082  datum1 = index_getattr(tuple1, nkey, tupDes, &isnull1);
4083  datum2 = index_getattr(tuple2, nkey, tupDes, &isnull2);
4084 
4085  compare = ApplySortComparator(datum1, isnull1,
4086  datum2, isnull2,
4087  sortKey);
4088  if (compare != 0)
4089  return compare; /* done when we find unequal attributes */
4090 
4091  /* they are equal, so we only need to examine one null flag */
4092  if (isnull1)
4093  equal_hasnull = true;
4094  }
4095 
4096  /*
4097  * If btree has asked us to enforce uniqueness, complain if two equal
4098  * tuples are detected (unless there was at least one NULL field).
4099  *
4100  * It is sufficient to make the test here, because if two tuples are equal
4101  * they *must* get compared at some stage of the sort --- otherwise the
4102  * sort algorithm wouldn't have checked whether one must appear before the
4103  * other.
4104  */
4105  if (state->enforceUnique && !equal_hasnull)
4106  {
4108  bool isnull[INDEX_MAX_KEYS];
4109  char *key_desc;
4110 
4111  /*
4112  * Some rather brain-dead implementations of qsort (such as the one in
4113  * QNX 4) will sometimes call the comparison routine to compare a
4114  * value to itself, but we always use our own implementation, which
4115  * does not.
4116  */
4117  Assert(tuple1 != tuple2);
4118 
4119  index_deform_tuple(tuple1, tupDes, values, isnull);
4120 
4121  key_desc = BuildIndexValueDescription(state->indexRel, values, isnull);
4122 
4123  ereport(ERROR,
4124  (errcode(ERRCODE_UNIQUE_VIOLATION),
4125  errmsg("could not create unique index \"%s\"",
4127  key_desc ? errdetail("Key %s is duplicated.", key_desc) :
4128  errdetail("Duplicate keys exist."),
4129  errtableconstraint(state->heapRel,
4130  RelationGetRelationName(state->indexRel))));
4131  }
4132 
4133  /*
4134  * If key values are equal, we sort on ItemPointer. This does not affect
4135  * validity of the finished index, but it may be useful to have index
4136  * scans in physical order.
4137  */
4138  {
4139  BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
4140  BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
4141 
4142  if (blk1 != blk2)
4143  return (blk1 < blk2) ? -1 : 1;
4144  }
4145  {
4148 
4149  if (pos1 != pos2)
4150  return (pos1 < pos2) ? -1 : 1;
4151  }
4152 
4153  return 0;
4154 }
4155 
4156 static int
4159 {
4160  uint32 hash1;
4161  uint32 hash2;
4162  IndexTuple tuple1;
4163  IndexTuple tuple2;
4164 
4165  /*
4166  * Fetch hash keys and mask off bits we don't want to sort by. We know
4167  * that the first column of the index tuple is the hash key.
4168  */
4169  Assert(!a->isnull1);
4170  hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
4171  Assert(!b->isnull1);
4172  hash2 = DatumGetUInt32(b->datum1) & state->hash_mask;
4173 
4174  if (hash1 > hash2)
4175  return 1;
4176  else if (hash1 < hash2)
4177  return -1;
4178 
4179  /*
4180  * If hash values are equal, we sort on ItemPointer. This does not affect
4181  * validity of the finished index, but it may be useful to have index
4182  * scans in physical order.
4183  */
4184  tuple1 = (IndexTuple) a->tuple;
4185  tuple2 = (IndexTuple) b->tuple;
4186 
4187  {
4188  BlockNumber blk1 = ItemPointerGetBlockNumber(&tuple1->t_tid);
4189  BlockNumber blk2 = ItemPointerGetBlockNumber(&tuple2->t_tid);
4190 
4191  if (blk1 != blk2)
4192  return (blk1 < blk2) ? -1 : 1;
4193  }
4194  {
4197 
4198  if (pos1 != pos2)
4199  return (pos1 < pos2) ? -1 : 1;
4200  }
4201 
4202  return 0;
4203 }
4204 
4205 static void
4207 {
4208  IndexTuple tuple = (IndexTuple) tup;
4209  unsigned int tuplen = IndexTupleSize(tuple);
4210  IndexTuple newtuple;
4211  Datum original;
4212 
4213  /* copy the tuple into sort storage */
4214  newtuple = (IndexTuple) MemoryContextAlloc(state->tuplecontext, tuplen);
4215  memcpy(newtuple, tuple, tuplen);
4216  USEMEM(state, GetMemoryChunkSpace(newtuple));
4217  stup->tuple = (void *) newtuple;
4218  /* set up first-column key value */
4219  original = index_getattr(newtuple,
4220  1,
4221  RelationGetDescr(state->indexRel),
4222  &stup->isnull1);
4223 
4224  if (!state->sortKeys->abbrev_converter || stup->isnull1)
4225  {
4226  /*
4227  * Store ordinary Datum representation, or NULL value. If there is a
4228  * converter it won't expect NULL values, and cost model is not
4229  * required to account for NULL, so in that case we avoid calling
4230  * converter and just set datum1 to zeroed representation (to be
4231  * consistent, and to support cheap inequality tests for NULL
4232  * abbreviated keys).
4233  */
4234  stup->datum1 = original;
4235  }
4236  else if (!consider_abort_common(state))
4237  {
4238  /* Store abbreviated key representation */
4239  stup->datum1 = state->sortKeys->abbrev_converter(original,
4240  state->sortKeys);
4241  }
4242  else
4243  {
4244  /* Abort abbreviation */
4245  int i;
4246 
4247  stup->datum1 = original;
4248 
4249  /*
4250  * Set state to be consistent with never trying abbreviation.
4251  *
4252  * Alter datum1 representation in already-copied tuples, so as to
4253  * ensure a consistent representation (current tuple was just
4254  * handled). It does not matter if some dumped tuples are already
4255  * sorted on tape, since serialized tuples lack abbreviated keys
4256  * (TSS_BUILDRUNS state prevents control reaching here in any case).
4257  */
4258  for (i = 0; i < state->memtupcount; i++)
4259  {
4260  SortTuple *mtup = &state->memtuples[i];
4261 
4262  tuple = (IndexTuple) mtup->tuple;
4263  mtup->datum1 = index_getattr(tuple,
4264  1,
4265  RelationGetDescr(state->indexRel),
4266  &mtup->isnull1);
4267  }
4268  }
4269 }
4270 
4271 static void
4273 {
4274  IndexTuple tuple = (IndexTuple) stup->tuple;
4275  unsigned int tuplen;
4276 
4277  tuplen = IndexTupleSize(tuple) + sizeof(tuplen);
4278  LogicalTapeWrite(state->tapeset, tapenum,
4279  (void *) &tuplen, sizeof(tuplen));
4280  LogicalTapeWrite(state->tapeset, tapenum,
4281  (void *) tuple, IndexTupleSize(tuple));
4282  if (state->randomAccess) /* need trailing length word? */
4283  LogicalTapeWrite(state->tapeset, tapenum,
4284  (void *) &tuplen, sizeof(tuplen));
4285 
4286  if (!state->slabAllocatorUsed)
4287  {
4288  FREEMEM(state, GetMemoryChunkSpace(tuple));
4289  pfree(tuple);
4290  }
4291 }
4292 
4293 static void
4295  int tapenum, unsigned int len)
4296 {
4297  unsigned int tuplen = len - sizeof(unsigned int);
4298  IndexTuple tuple = (IndexTuple) readtup_alloc(state, tuplen);
4299 
4300  LogicalTapeReadExact(state->tapeset, tapenum,
4301  tuple, tuplen);
4302  if (state->randomAccess) /* need trailing length word? */
4303  LogicalTapeReadExact(state->tapeset, tapenum,
4304  &tuplen, sizeof(tuplen));
4305  stup->tuple = (void *) tuple;
4306  /* set up first-column key value */
4307  stup->datum1 = index_getattr(tuple,
4308  1,
4309  RelationGetDescr(state->indexRel),
4310  &stup->isnull1);
4311 }
4312 
4313 /*
4314  * Routines specialized for DatumTuple case
4315  */
4316 
4317 static int
4319 {
4320  int compare;
4321 
4322  compare = ApplySortComparator(a->datum1, a->isnull1,
4323  b->datum1, b->isnull1,
4324  state->sortKeys);
4325  if (compare != 0)
4326  return compare;
4327 
4328  /* if we have abbreviations, then "tuple" has the original value */
4329 
4330  if (state->sortKeys->abbrev_converter)
4332  PointerGetDatum(b->tuple), b->isnull1,
4333  state->sortKeys);
4334 
4335  return compare;
4336 }
4337 
4338 static void
4340 {
4341  /* Not currently needed */
4342  elog(ERROR, "copytup_datum() should not be called");
4343 }
4344 
4345 static void
4347 {
4348  void *waddr;
4349  unsigned int tuplen;
4350  unsigned int writtenlen;
4351 
4352  if (stup->isnull1)
4353  {
4354  waddr = NULL;
4355  tuplen = 0;
4356  }
4357  else if (!state->tuples)
4358  {
4359  waddr = &stup->datum1;
4360  tuplen = sizeof(Datum);
4361  }
4362  else
4363  {
4364  waddr = stup->tuple;
4365  tuplen = datumGetSize(PointerGetDatum(stup->tuple), false, state->datumTypeLen);
4366  Assert(tuplen != 0);
4367  }
4368 
4369  writtenlen = tuplen + sizeof(unsigned int);
4370 
4371  LogicalTapeWrite(state->tapeset, tapenum,
4372  (void *) &writtenlen, sizeof(writtenlen));
4373  LogicalTapeWrite(state->tapeset, tapenum,
4374  waddr, tuplen);
4375  if (state->randomAccess) /* need trailing length word? */
4376  LogicalTapeWrite(state->tapeset, tapenum,
4377  (void *) &writtenlen, sizeof(writtenlen));
4378 
4379  if (!state->slabAllocatorUsed && stup->tuple)
4380  {
4381  FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
4382  pfree(stup->tuple);
4383  }
4384 }
4385 
4386 static void
4388  int tapenum, unsigned int len)
4389 {
4390  unsigned int tuplen = len - sizeof(unsigned int);
4391 
4392  if (tuplen == 0)
4393  {
4394  /* it's NULL */
4395  stup->datum1 = (Datum) 0;
4396  stup->isnull1 = true;
4397  stup->tuple = NULL;
4398  }
4399  else if (!state->tuples)
4400  {
4401  Assert(tuplen == sizeof(Datum));
4402  LogicalTapeReadExact(state->tapeset, tapenum,
4403  &stup->datum1, tuplen);
4404  stup->isnull1 = false;
4405  stup->tuple = NULL;
4406  }
4407  else
4408  {
4409  void *raddr = readtup_alloc(state, tuplen);
4410 
4411  LogicalTapeReadExact(state->tapeset, tapenum,
4412  raddr, tuplen);
4413  stup->datum1 = PointerGetDatum(raddr);
4414  stup->isnull1 = false;
4415  stup->tuple = raddr;
4416  }
4417 
4418  if (state->randomAccess) /* need trailing length word? */
4419  LogicalTapeReadExact(state->tapeset, tapenum,
4420  &tuplen, sizeof(tuplen));
4421 }
4422 
4423 /*
4424  * Convenience routine to free a tuple previously loaded into sort memory
4425  */
4426 static void
4428 {
4429  FREEMEM(state, GetMemoryChunkSpace(stup->tuple));
4430  pfree(stup->tuple);
4431 }
struct SortSupportData * SortSupport
Definition: sortsupport.h:58
IndexTuple tuplesort_getindextuple(Tuplesortstate *state, bool forward)
Definition: tuplesort.c:2154
#define MINIMAL_TUPLE_DATA_OFFSET
Definition: htup_details.h:624
signed short int16
Definition: c.h:252
static int comparetup_index_hash(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
Definition: tuplesort.c:4157
int(* comparator)(Datum x, Datum y, SortSupport ssup)
Definition: sortsupport.h:107
void FormIndexDatum(IndexInfo *indexInfo, TupleTableSlot *slot, EState *estate, Datum *values, bool *isnull)
Definition: index.c:1764
HeapTuple heap_copytuple(HeapTuple tuple)
Definition: heaptuple.c:608
bool ssup_nulls_first
Definition: sortsupport.h:75
#define DatumGetUInt32(X)
Definition: postgres.h:494
ScanKey _bt_mkscankey_nodata(Relation rel)
Definition: nbtutils.c:115
size_t LogicalTapeRead(LogicalTapeSet *lts, int tapenum, void *ptr, size_t size)
Definition: logtape.c:655
TupleTableSlot * ExecStoreTuple(HeapTuple tuple, TupleTableSlot *slot, Buffer buffer, bool shouldFree)
Definition: execTuples.c:320
static int comparetup_heap(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
Definition: tuplesort.c:3583
static void dumptuples(Tuplesortstate *state, bool alltuples)
Definition: tuplesort.c:2930
static void reversedirection(Tuplesortstate *state)
Definition: tuplesort.c:3510
bool tuplesort_getdatum(Tuplesortstate *state, bool forward, Datum *val, bool *isNull, Datum *abbrev)
Definition: tuplesort.c:2183
int64 availMem
Definition: tuplesort.c:280
size_t read_buffer_size
Definition: tuplesort.c:372
TupSortStatus status
Definition: tuplesort.c:272
static void writetup_cluster(Tuplesortstate *state, int tapenum, SortTuple *stup)
Definition: tuplesort.c:3966
Relation heapRel
Definition: tuplesort.c:469
void tuplesort_performsort(Tuplesortstate *state)
Definition: tuplesort.c:1763
static bool grow_memtuples(Tuplesortstate *state)
Definition: tuplesort.c:1235
void MemoryContextDelete(MemoryContext context)
Definition: mcxt.c:200
HeapTupleData * HeapTuple
Definition: htup.h:70
HeapTuple tuplesort_getheaptuple(Tuplesortstate *state, bool forward)
Definition: tuplesort.c:2134
TupleTableSlot * ExecStoreMinimalTuple(MinimalTuple mtup, TupleTableSlot *slot, bool shouldFree)
Definition: execTuples.c:384
#define BTGreaterStrategyNumber
Definition: stratnum.h:33
void tuplesort_putdatum(Tuplesortstate *state, Datum val, bool isNull)
Definition: tuplesort.c:1475
#define AssertState(condition)
Definition: c.h:674
char * slabMemoryEnd
Definition: tuplesort.c:368
void tuplesort_restorepos(Tuplesortstate *state)
Definition: tuplesort.c:3182
static void mergeonerun(Tuplesortstate *state)
Definition: tuplesort.c:2778
PGRUsage ru_start
Definition: tuplesort.c:490
static void sort_bounded_heap(Tuplesortstate *state)
Definition: tuplesort.c:3343
static void tuplesort_heap_delete_top(Tuplesortstate *state, bool checkIndex)
Definition: tuplesort.c:3448
#define ResetPerTupleExprContext(estate)
Definition: executor.h:347
int64 abbrevNext
Definition: tuplesort.c:455
#define RelationGetDescr(relation)
Definition: rel.h:425
#define RelationGetNumberOfAttributes(relation)
Definition: rel.h:419
EState * estate
Definition: tuplesort.c:463
#define PointerGetDatum(X)
Definition: postgres.h:564
HeapTupleHeaderData * HeapTupleHeader
Definition: htup.h:23
static void copytup_index(Tuplesortstate *state, SortTuple *stup, void *tup)
Definition: tuplesort.c:4206
#define RUN_FIRST
Definition: tuplesort.c:260
SortSupport sortKeys
Definition: tuplesort.c:441
void _bt_freeskey(ScanKey skey)
Definition: nbtutils.c:155
ItemPointerData t_tid
Definition: itup.h:37
#define BTREE_AM_OID
Definition: pg_am.h:70
#define Min(x, y)
Definition: c.h:802
char buffer[SLAB_SLOT_SIZE]
Definition: tuplesort.c:222
MinimalTuple ExecCopySlotMinimalTuple(TupleTableSlot *slot)
Definition: execTuples.c:577
void PrepareSortSupportFromOrderingOp(Oid orderingOp, SortSupport ssup)
Definition: sortsupport.c:133
bool randomAccess
Definition: tuplesort.c:274
TupleTableSlot * ExecClearTuple(TupleTableSlot *slot)
Definition: execTuples.c:439
Tuplesortstate * tuplesort_begin_index_btree(Relation heapRel, Relation indexRel, bool enforceUnique, int workMem, bool randomAccess)
Definition: tuplesort.c:917
static MemoryContext MemoryContextSwitchTo(MemoryContext context)
Definition: palloc.h:109
Datum datum1
Definition: tuplesort.c:201
#define InvalidBuffer
Definition: buf.h:25
SortTupleComparator comparetup
Definition: tuplesort.c:297
#define CLUSTER_SORT
Definition: tuplesort.c:150
int errcode(int sqlerrcode)
Definition: elog.c:575
#define LogicalTapeReadExact(tapeset, tapenum, ptr, len)
Definition: tuplesort.c:573
bool growmemtuples
Definition: tuplesort.c:336
Size GetMemoryChunkSpace(void *pointer)
Definition: mcxt.c:390
#define SLAB_SLOT_SIZE
Definition: tuplesort.c:217
static int comparetup_cluster(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
Definition: tuplesort.c:3782
void MemoryContextReset(MemoryContext context)
Definition: mcxt.c:135
uint32 BlockNumber
Definition: block.h:31
#define INDEX_SORT
Definition: tuplesort.c:148
IndexInfo * BuildIndexInfo(Relation index)
Definition: index.c:1639
void(* writetup)(Tuplesortstate *state, int tapenum, SortTuple *stup)
Definition: tuplesort.c:315
#define LOG
Definition: elog.h:26
Form_pg_class rd_rel
Definition: rel.h:113
void heap_freetuple(HeapTuple htup)
Definition: heaptuple.c:1374
unsigned int Oid
Definition: postgres_ext.h:31
#define HEAP_SORT
Definition: tuplesort.c:147
static void copytup_cluster(Tuplesortstate *state, SortTuple *stup, void *tup)
Definition: tuplesort.c:3893
static int comparetup_index_btree(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
Definition: tuplesort.c:4028
bool isnull1
Definition: tuplesort.c:202
MinimalTuple heap_copy_minimal_tuple(MinimalTuple mtup)
Definition: heaptuple.c:1481
int markpos_offset
Definition: tuplesort.c:432
bool trace_sort
Definition: tuplesort.c:154
static void mergeruns(Tuplesortstate *state)
Definition: tuplesort.c:2533
void LogicalTapeRewindForWrite(LogicalTapeSet *lts, int tapenum)
Definition: logtape.c:629
static void markrunend(Tuplesortstate *state, int tapenum)
Definition: tuplesort.c:3541
static unsigned int getlen(Tuplesortstate *state, int tapenum, bool eofOK)
Definition: tuplesort.c:3528
static void tuplesort_heap_replace_top(Tuplesortstate *state, SortTuple *tuple, bool checkIndex)
Definition: tuplesort.c:3473
#define MaxAllocHugeSize
Definition: memutils.h:44
signed int int32
Definition: c.h:253
static void init_slab_allocator(Tuplesortstate *state, int numSlots)
Definition: tuplesort.c:2497
LogicalTapeSet * LogicalTapeSetCreate(int ntapes)
Definition: logtape.c:379
#define DATUM_SORT
Definition: tuplesort.c:149
uint16 OffsetNumber
Definition: off.h:24
HeapTupleHeader t_data
Definition: htup.h:67
void pg_rusage_init(PGRUsage *ru0)
Definition: pg_rusage.c:27
#define HEAPCOMPARE(tup1, tup2)
Definition: tuplesort.c:3274
IndexTuple index_form_tuple(TupleDesc tupleDescriptor, Datum *values, bool *isnull)
Definition: indextuple.c:37
void FreeExecutorState(EState *estate)
Definition: execUtils.c:168
#define GetPerTupleExprContext(estate)
Definition: executor.h:338
void * tuple
Definition: tuplesort.c:200
int errtableconstraint(Relation rel, const char *conname)
Definition: relcache.c:5209
bool replaceActive
Definition: tuplesort.c:387
#define TAPE_BUFFER_OVERHEAD
Definition: tuplesort.c:252
static bool consider_abort_common(Tuplesortstate *state)
Definition: tuplesort.c:1719
void pfree(void *pointer)
Definition: mcxt.c:992
union SlabSlot * nextfree
Definition: tuplesort.c:221
static int compare(const void *arg1, const void *arg2)
Definition: geqo_pool.c:145
void LogicalTapeWrite(LogicalTapeSet *lts, int tapenum, void *ptr, size_t size)
Definition: logtape.c:464
static void readtup_cluster(Tuplesortstate *state, SortTuple *stup, int tapenum, unsigned int len)
Definition: tuplesort.c:3990
#define ERROR
Definition: elog.h:43
void PrepareTempTablespaces(void)
Definition: tablespace.c:1288
void heap_free_minimal_tuple(MinimalTuple mtup)
Definition: heaptuple.c:1469
void * lastReturnedTuple
Definition: tuplesort.c:380
MemoryContext sortcontext
Definition: tuplesort.c:284
static void writetup_heap(Tuplesortstate *state, int tapenum, SortTuple *stup)
Definition: tuplesort.c:3723
static void writetup_index(Tuplesortstate *state, int tapenum, SortTuple *stup)
Definition: tuplesort.c:4272
MemoryContext ssup_cxt
Definition: sortsupport.h:66
#define MERGE_BUFFER_SIZE
Definition: tuplesort.c:253
ItemPointerData t_self
Definition: htup.h:65
static int comparetup_datum(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
Definition: tuplesort.c:4318
void tuplesort_rescan(Tuplesortstate *state)
Definition: tuplesort.c:3115
#define ALLOCSET_DEFAULT_SIZES
Definition: memutils.h:145
Tuplesortstate * tuplesort_begin_index_hash(Relation heapRel, Relation indexRel, uint32 hash_mask, int workMem, bool randomAccess)
Definition: tuplesort.c:992
static void free_sort_tuple(Tuplesortstate *state, SortTuple *stup)
Definition: tuplesort.c:4427
static void readtup_index(Tuplesortstate *state, SortTuple *stup, int tapenum, unsigned int len)
Definition: tuplesort.c:4294
Tuplesortstate * tuplesort_begin_datum(Oid datumType, Oid sortOperator, Oid sortCollation, bool nullsFirstFlag, int workMem, bool randomAccess)
Definition: tuplesort.c:1028
uint32 t_len
Definition: htup.h:64
static void copytup_datum(Tuplesortstate *state, SortTuple *stup, void *tup)
Definition: tuplesort.c:4339
void tuplesort_set_bound(Tuplesortstate *state, int64 bound)
Definition: tuplesort.c:1113
static char * buf
Definition: pg_test_fsync.c:65
static void readtup_datum(Tuplesortstate *state, SortTuple *stup, int tapenum, unsigned int len)
Definition: tuplesort.c:4387
void ExecDropSingleTupleTableSlot(TupleTableSlot *slot)
Definition: execTuples.c:216
MinimalTupleData * MinimalTuple
Definition: htup.h:27
IndexTupleData * IndexTuple
Definition: itup.h:53
#define COMPARETUP(state, a, b)
Definition: tuplesort.c:517
#define ALLOCSET_SEPARATE_THRESHOLD
Definition: memutils.h:172
static void dumpbatch(Tuplesortstate *state, bool alltuples)
Definition: tuplesort.c:3022
void(* readtup)(Tuplesortstate *state, SortTuple *stup, int tapenum, unsigned int len)
Definition: tuplesort.c:323
int errdetail(const char *fmt,...)
Definition: elog.c:873
size_t LogicalTapeBackspace(LogicalTapeSet *lts, int tapenum, size_t size)
Definition: logtape.c:768
#define MINORDER
Definition: tuplesort.c:250
union SlabSlot SlabSlot
IndexInfo * indexInfo
Definition: tuplesort.c:462
const char * pg_rusage_show(const PGRUsage *ru0)
Definition: pg_rusage.c:40
#define FREEMEM(state, amt)
Definition: tuplesort.c:523
#define RelationGetRelationName(relation)
Definition: rel.h:433
static bool tuplesort_gettuple_common(Tuplesortstate *state, bool forward, SortTuple *stup)
Definition: tuplesort.c:1848
unsigned int uint32
Definition: c.h:265
LogicalTapeSet * tapeset
Definition: tuplesort.c:286
int(* abbrev_full_comparator)(Datum x, Datum y, SortSupport ssup)
Definition: sortsupport.h:192
Oid t_tableOid
Definition: htup.h:66
void index_deform_tuple(IndexTuple tup, TupleDesc tupleDescriptor, Datum *values, bool *isnull)
Definition: indextuple.c:416
void LogicalTapeTell(LogicalTapeSet *lts, int tapenum, long *blocknum, int *offset)
Definition: logtape.c:870
MemoryContext CurrentMemoryContext
Definition: mcxt.c:37
int64 allowedMem
Definition: tuplesort.c:281
TupleTableSlot * MakeSingleTupleTableSlot(TupleDesc tupdesc)
Definition: execTuples.c:199
void PrepareSortSupportFromIndexRel(Relation indexRel, int16 strategy, SortSupport ssup)
Definition: sortsupport.c:160
int replacement_sort_tuples
Definition: globals.c:114
#define ereport(elevel, rest)
Definition: elog.h:122
void tuplesort_get_stats(Tuplesortstate *state, const char **sortMethod, const char **spaceType, long *spaceUsed)
Definition: tuplesort.c:3217
static void selectnewtape(Tuplesortstate *state)
Definition: tuplesort.c:2465
#define AssertArg(condition)
Definition: c.h:673
Datum datumCopy(Datum value, bool typByVal, int typLen)
Definition: datum.c:128
#define COPYTUP(state, stup, tup)
Definition: tuplesort.c:518
EState * CreateExecutorState(void)
Definition: execUtils.c:72
static void copytup_heap(Tuplesortstate *state, SortTuple *stup, void *tup)
Definition: tuplesort.c:3645
int(* SortTupleComparator)(const SortTuple *a, const SortTuple *b, Tuplesortstate *state)
Definition: tuplesort.c:264
#define heap_getattr(tup, attnum, tupleDesc, isnull)
Definition: htup_details.h:769
#define READTUP(state, stup, tape, len)
Definition: tuplesort.c:520
#define SK_BT_NULLS_FIRST
Definition: nbtree.h:428
#define WRITETUP(state, tape, stup)
Definition: tuplesort.c:519
static void writetup_datum(Tuplesortstate *state, int tapenum, SortTuple *stup)
Definition: tuplesort.c:4346
#define RELEASE_SLAB_SLOT(state, tuple)
Definition: tuplesort.c:505
MemoryContext AllocSetContextCreate(MemoryContext parent, const char *name, Size minContextSize, Size initBlockSize, Size maxBlockSize)
Definition: aset.c:440
void(* copytup)(Tuplesortstate *state, SortTuple *stup, void *tup)
Definition: tuplesort.c:305
void * palloc0(Size size)
Definition: mcxt.c:920
Relation indexRel
Definition: tuplesort.c:470
char * slabMemoryBegin
Definition: tuplesort.c:367
uintptr_t Datum
Definition: postgres.h:374
AttrNumber ssup_attno
Definition: sortsupport.h:81
static void puttuple_common(Tuplesortstate *state, SortTuple *tuple)
Definition: tuplesort.c:1557
#define InvalidOid
Definition: postgres_ext.h:36
void tuplesort_putindextuplevalues(Tuplesortstate *state, Relation rel, ItemPointer self, Datum *values, bool *isnull)
Definition: tuplesort.c:1397
static bool mergereadnext(Tuplesortstate *state, int srcTape, SortTuple *stup)
Definition: tuplesort.c:2890
#define Max(x, y)
Definition: c.h:796
int sk_flags
Definition: skey.h:66
#define NULL
Definition: c.h:226
List * ii_Expressions
Definition: execnodes.h:67
#define Assert(condition)
Definition: c.h:671
#define SK_BT_DESC
Definition: nbtree.h:427
Definition: regguts.h:298
static void inittapes(Tuplesortstate *state)
Definition: tuplesort.c:2351
bool enforceUnique
Definition: tuplesort.c:473
long markpos_block
Definition: tuplesort.c:431
struct ItemPointerData ItemPointerData
#define INDEX_MAX_KEYS
size_t Size
Definition: c.h:353
bool eof_reached
Definition: tuplesort.c:428
static void tuplesort_heap_insert(Tuplesortstate *state, SortTuple *tuple, bool checkIndex)
Definition: tuplesort.c:3411
TupSortStatus
Definition: tuplesort.c:229
int tuplesort_merge_order(int64 allowedMem)
Definition: tuplesort.c:2289
void get_typlenbyval(Oid typid, int16 *typlen, bool *typbyval)
Definition: lsyscache.c:1969
TupleTableSlot * ecxt_scantuple
Definition: execnodes.h:128
#define index_getattr(tup, attnum, tupleDesc, isnull)
Definition: itup.h:100
static void tuplesort_sort_memtuples(Tuplesortstate *state)
Definition: tuplesort.c:3385
AttrNumber ii_KeyAttrNumbers[INDEX_MAX_KEYS]
Definition: execnodes.h:66
#define ItemPointerGetOffsetNumber(pointer)
Definition: itemptr.h:76
int tupindex
Definition: tuplesort.c:203
#define HEAP_RUN_NEXT
Definition: tuplesort.c:261
static bool useselection(Tuplesortstate *state)
Definition: tuplesort.c:2331
#define MINIMAL_TUPLE_OFFSET
Definition: htup_details.h:620
#define INT64_FORMAT
Definition: c.h:312
void LogicalTapeRewindForRead(LogicalTapeSet *lts, int tapenum, size_t buffer_size)
Definition: logtape.c:551
int * tp_dummy
Definition: tuplesort.c:417
static void readtup_heap(Tuplesortstate *state, SortTuple *stup, int tapenum, unsigned int len)
Definition: tuplesort.c:3750
bool slabAllocatorUsed
Definition: tuplesort.c:365
#define DatumGetPointer(X)
Definition: postgres.h:557
Tuplesortstate * tuplesort_begin_cluster(TupleDesc tupDesc, Relation indexRel, int workMem, bool randomAccess)
Definition: tuplesort.c:825
static Datum values[MAXATTR]
Definition: bootstrap.c:162
static void * readtup_alloc(Tuplesortstate *state, Size tuplen)
Definition: tuplesort.c:3555
MemoryContext tuplecontext
Definition: tuplesort.c:285
Tuplesortstate * tuplesort_begin_heap(TupleDesc tupDesc, int nkeys, AttrNumber *attNums, Oid *sortOperators, Oid *sortCollations, bool *nullsFirstFlags, int workMem, bool randomAccess)
Definition: tuplesort.c:753
uint32 hash_mask
Definition: tuplesort.c:476
void * repalloc_huge(void *pointer, Size size)
Definition: mcxt.c:1104
void * palloc(Size size)
Definition: mcxt.c:891
int errmsg(const char *fmt,...)
Definition: elog.c:797
int * tp_tapenum
Definition: tuplesort.c:418
#define USEMEM(state, amt)
Definition: tuplesort.c:522
bool markpos_eof
Definition: tuplesort.c:433
#define HEAPTUPLESIZE
Definition: htup.h:72
void * MemoryContextAlloc(MemoryContext context, Size size)
Definition: mcxt.c:749
Oid sk_collation
Definition: skey.h:70
int i
void tuplesort_markpos(Tuplesortstate *state)
Definition: tuplesort.c:3150
void tuplesort_putheaptuple(Tuplesortstate *state, HeapTuple tup)
Definition: tuplesort.c:1376
void LogicalTapeSetClose(LogicalTapeSet *lts)
Definition: logtape.c:427
void LogicalTapeSeek(LogicalTapeSet *lts, int tapenum, long blocknum, int offset)
Definition: logtape.c:839
#define CHECK_FOR_INTERRUPTS()
Definition: miscadmin.h:97
SortSupport onlyKey
Definition: tuplesort.c:447
#define elog
Definition: elog.h:219
void LogicalTapeSetForgetFreeSpace(LogicalTapeSet *lts)
Definition: logtape.c:453
char * BuildIndexValueDescription(Relation indexRelation, Datum *values, bool *isnull)
Definition: genam.c:171
#define ItemPointerGetBlockNumber(pointer)
Definition: itemptr.h:66
SlabSlot * slabFreeHead
Definition: tuplesort.c:369
static Tuplesortstate * tuplesort_begin_common(int workMem, bool randomAccess)
Definition: tuplesort.c:667
long LogicalTapeSetBlocks(LogicalTapeSet *lts)
Definition: logtape.c:889
void tuplesort_end(Tuplesortstate *state)
Definition: tuplesort.c:1157
bool(* abbrev_abort)(int memtupcount, SortSupport ssup)
Definition: sortsupport.h:183
#define BTLessStrategyNumber
Definition: stratnum.h:29
static int ApplySortComparator(Datum datum1, bool isNull1, Datum datum2, bool isNull2, SortSupport ssup)
Definition: sortsupport.h:201
Size datumGetSize(Datum value, bool typByVal, int typLen)
Definition: datum.c:61
bool tuplesort_skiptuples(Tuplesortstate *state, int64 ntuples, bool forward)
Definition: tuplesort.c:2222
static void make_bounded_heap(Tuplesortstate *state)
Definition: tuplesort.c:3294
int16 AttrNumber
Definition: attnum.h:21
#define MAXORDER
Definition: tuplesort.c:251
static void beginmerge(Tuplesortstate *state)
Definition: tuplesort.c:2842
#define LACKMEM(state)
Definition: tuplesort.c:521
void LogicalTapeFreeze(LogicalTapeSet *lts, int tapenum)
Definition: logtape.c:703
long val
Definition: informix.c:689
Datum(* abbrev_converter)(Datum original, SortSupport ssup)
Definition: sortsupport.h:173
#define RUN_SECOND
Definition: tuplesort.c:262
bool * mergeactive
Definition: tuplesort.c:406
AttrNumber sk_attno
Definition: skey.h:67
TupleDesc tupDesc
Definition: tuplesort.c:440
#define IndexTupleSize(itup)
Definition: itup.h:70
void tuplesort_puttupleslot(Tuplesortstate *state, TupleTableSlot *slot)
Definition: tuplesort.c:1354
bool tuplesort_gettupleslot(Tuplesortstate *state, bool forward, TupleTableSlot *slot, Datum *abbrev)
Definition: tuplesort.c:2099
SortTuple * memtuples
Definition: tuplesort.c:333
static int ApplySortAbbrevFullComparator(Datum datum1, bool isNull1, Datum datum2, bool isNull2, SortSupport ssup)
Definition: sortsupport.h:239