PostgreSQL Source Code git master
All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Macros Pages
trgm_regexp.c
Go to the documentation of this file.
1/*-------------------------------------------------------------------------
2 *
3 * trgm_regexp.c
4 * Regular expression matching using trigrams.
5 *
6 * The general idea of trigram index support for a regular expression (regex)
7 * search is to transform the regex into a logical expression on trigrams.
8 * For example:
9 *
10 * (ab|cd)efg => ((abe & bef) | (cde & def)) & efg
11 *
12 * If a string matches the regex, then it must match the logical expression on
13 * trigrams. The opposite is not necessarily true, however: a string that
14 * matches the logical expression might not match the original regex. Such
15 * false positives are removed via recheck, by running the regular regex match
16 * operator on the retrieved heap tuple.
17 *
18 * Since the trigram expression involves both AND and OR operators, we can't
19 * expect the core index machinery to evaluate it completely. Instead, the
20 * result of regex analysis is a list of trigrams to be sought in the index,
21 * plus a simplified graph that is used by trigramsMatchGraph() to determine
22 * whether a particular indexed value matches the expression.
23 *
24 * Converting a regex to a trigram expression is based on analysis of an
25 * automaton corresponding to the regex. The algorithm consists of four
26 * stages:
27 *
28 * 1) Compile the regexp to NFA form. This is handled by the PostgreSQL
29 * regexp library, which provides accessors for its opaque regex_t struct
30 * to expose the NFA state graph and the "colors" (sets of equivalent
31 * characters) used as state transition labels.
32 *
33 * 2) Transform the original NFA into an expanded graph, where arcs
34 * are labeled with trigrams that must be present in order to move from
35 * one state to another via the arcs. The trigrams used in this stage
36 * consist of colors, not characters, as in the original NFA.
37 *
38 * 3) Expand the color trigrams into regular trigrams consisting of
39 * characters. If too many distinct trigrams are produced, trigrams are
40 * eliminated and the graph is simplified until it's simple enough.
41 *
42 * 4) Finally, the resulting graph is packed into a TrgmPackedGraph struct,
43 * and returned to the caller.
44 *
45 * 1) Compile the regexp to NFA form
46 * ---------------------------------
47 * The automaton returned by the regexp compiler is a graph where vertices
48 * are "states" and arcs are labeled with colors. Each color represents
49 * a set of characters, so that all characters assigned to the same color
50 * are interchangeable, so far as matching the regexp is concerned. There
51 * are two special states: "initial" and "final". A state can have multiple
52 * outgoing arcs labeled with the same color, which makes the automaton
53 * non-deterministic, because it can be in many states simultaneously.
54 *
55 * Note that this NFA is already lossy compared to the original regexp,
56 * since it ignores some regex features such as lookahead constraints and
57 * backref matching. This is OK for our purposes since it's still the case
58 * that only strings matching the NFA can possibly satisfy the regexp.
59 *
60 * 2) Transform the original NFA into an expanded graph
61 * ----------------------------------------------------
62 * In the 2nd stage, the automaton is transformed into a graph based on the
63 * original NFA. Each state in the expanded graph represents a state from
64 * the original NFA, plus a prefix identifying the last two characters
65 * (colors, to be precise) seen before entering the state. There can be
66 * multiple states in the expanded graph for each state in the original NFA,
67 * depending on what characters can precede it. A prefix position can be
68 * "unknown" if it's uncertain what the preceding character was, or "blank"
69 * if the character was a non-word character (we don't need to distinguish
70 * which non-word character it was, so just think of all of them as blanks).
71 *
72 * For convenience in description, call an expanded-state identifier
73 * (two prefix colors plus a state number from the original NFA) an
74 * "enter key".
75 *
76 * Each arc of the expanded graph is labeled with a trigram that must be
77 * present in the string to match. We can construct this from an out-arc of
78 * the underlying NFA state by combining the expanded state's prefix with the
79 * color label of the underlying out-arc, if neither prefix position is
80 * "unknown". But note that some of the colors in the trigram might be
81 * "blank". This is OK since we want to generate word-boundary trigrams as
82 * the regular trigram machinery would, if we know that some word characters
83 * must be adjacent to a word boundary in all strings matching the NFA.
84 *
85 * The expanded graph can also have fewer states than the original NFA,
86 * because we don't bother to make a separate state entry unless the state
87 * is reachable by a valid arc. When an enter key is reachable from a state
88 * of the expanded graph, but we do not know a complete trigram associated
89 * with that transition, we cannot make a valid arc; instead we insert the
90 * enter key into the enterKeys list of the source state. This effectively
91 * means that the two expanded states are not reliably distinguishable based
92 * on examining trigrams.
93 *
94 * So the expanded graph resembles the original NFA, but the arcs are
95 * labeled with trigrams instead of individual characters, and there may be
96 * more or fewer states. It is a lossy representation of the original NFA:
97 * any string that matches the original regexp must match the expanded graph,
98 * but the reverse is not true.
99 *
100 * We build the expanded graph through a breadth-first traversal of states
101 * reachable from the initial state. At each reachable state, we identify the
102 * states reachable from it without traversing a predictable trigram, and add
103 * those states' enter keys to the current state. Then we generate all
104 * out-arcs leading out of this collection of states that have predictable
105 * trigrams, adding their target states to the queue of states to examine.
106 *
107 * When building the graph, if the number of states or arcs exceed pre-defined
108 * limits, we give up and simply mark any states not yet processed as final
109 * states. Roughly speaking, that means that we make use of some portion from
110 * the beginning of the regexp. Also, any colors that have too many member
111 * characters are treated as "unknown", so that we can't derive trigrams
112 * from them.
113 *
114 * 3) Expand the color trigrams into regular trigrams
115 * --------------------------------------------------
116 * The trigrams in the expanded graph are "color trigrams", consisting
117 * of three consecutive colors that must be present in the string. But for
118 * search, we need regular trigrams consisting of characters. In the 3rd
119 * stage, the color trigrams are expanded into regular trigrams. Since each
120 * color can represent many characters, the total number of regular trigrams
121 * after expansion could be very large. Because searching the index for
122 * thousands of trigrams would be slow, and would likely produce so many
123 * false positives that we would have to traverse a large fraction of the
124 * index, the graph is simplified further in a lossy fashion by removing
125 * color trigrams. When a color trigram is removed, the states connected by
126 * any arcs labeled with that trigram are merged.
127 *
128 * Trigrams do not all have equivalent value for searching: some of them are
129 * more frequent and some of them are less frequent. Ideally, we would like
130 * to know the distribution of trigrams, but we don't. But because of padding
131 * we know for sure that the empty character is more frequent than others,
132 * so we can penalize trigrams according to presence of whitespace. The
133 * penalty assigned to each color trigram is the number of simple trigrams
134 * it would produce, times the penalties[] multiplier associated with its
135 * whitespace content. (The penalties[] constants were calculated by analysis
136 * of some real-life text.) We eliminate color trigrams starting with the
137 * highest-penalty one, until we get to a total penalty of no more than
138 * WISH_TRGM_PENALTY. However, we cannot remove a color trigram if that would
139 * lead to merging the initial and final states, so we may not be able to
140 * reach WISH_TRGM_PENALTY. It's still okay so long as we have no more than
141 * MAX_TRGM_COUNT simple trigrams in total, otherwise we fail.
142 *
143 * 4) Pack the graph into a compact representation
144 * -----------------------------------------------
145 * The 2nd and 3rd stages might have eliminated or merged many of the states
146 * and trigrams created earlier, so in this final stage, the graph is
147 * compacted and packed into a simpler struct that contains only the
148 * information needed to evaluate it.
149 *
150 * ALGORITHM EXAMPLE:
151 *
152 * Consider the example regex "ab[cd]". This regex is transformed into the
153 * following NFA (for simplicity we show colors as their single members):
154 *
155 * 4#
156 * c/
157 * a b /
158 * 1* --- 2 ---- 3
159 * \
160 * d\
161 * 5#
162 *
163 * We use * to mark initial state and # to mark final state. It's not depicted,
164 * but states 1, 4, 5 have self-referencing arcs for all possible characters,
165 * because this pattern can match to any part of a string.
166 *
167 * As the result of stage 2 we will have the following graph:
168 *
169 * abc abd
170 * 2# <---- 1* ----> 3#
171 *
172 * The process for generating this graph is:
173 * 1) Create state 1 with enter key (UNKNOWN, UNKNOWN, 1).
174 * 2) Add key (UNKNOWN, "a", 2) to state 1.
175 * 3) Add key ("a", "b", 3) to state 1.
176 * 4) Create new state 2 with enter key ("b", "c", 4). Add an arc
177 * from state 1 to state 2 with label trigram "abc".
178 * 5) Mark state 2 final because state 4 of source NFA is marked as final.
179 * 6) Create new state 3 with enter key ("b", "d", 5). Add an arc
180 * from state 1 to state 3 with label trigram "abd".
181 * 7) Mark state 3 final because state 5 of source NFA is marked as final.
182 *
183 *
184 * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
185 * Portions Copyright (c) 1994, Regents of the University of California
186 *
187 * IDENTIFICATION
188 * contrib/pg_trgm/trgm_regexp.c
189 *
190 *-------------------------------------------------------------------------
191 */
192#include "postgres.h"
193
194#include "catalog/pg_collation_d.h"
195#include "regex/regexport.h"
196#include "trgm.h"
197#include "tsearch/ts_locale.h"
198#include "utils/formatting.h"
199#include "utils/hsearch.h"
200#include "utils/memutils.h"
201#include "varatt.h"
202
203/*
204 * Uncomment (or use -DTRGM_REGEXP_DEBUG) to print debug info,
205 * for exploring and debugging the algorithm implementation.
206 * This produces three graph files in /tmp, in Graphviz .gv format.
207 * Some progress information is also printed to postmaster stderr.
208 */
209/* #define TRGM_REGEXP_DEBUG */
210
211/*
212 * These parameters are used to limit the amount of work done.
213 * Otherwise regex processing could be too slow and memory-consuming.
214 *
215 * MAX_EXPANDED_STATES - How many states we allow in expanded graph
216 * MAX_EXPANDED_ARCS - How many arcs we allow in expanded graph
217 * MAX_TRGM_COUNT - How many simple trigrams we allow to be extracted
218 * WISH_TRGM_PENALTY - Maximum desired sum of color trigram penalties
219 * COLOR_COUNT_LIMIT - Maximum number of characters per color
220 */
221#define MAX_EXPANDED_STATES 128
222#define MAX_EXPANDED_ARCS 1024
223#define MAX_TRGM_COUNT 256
224#define WISH_TRGM_PENALTY 16
225#define COLOR_COUNT_LIMIT 256
226
227/*
228 * Penalty multipliers for trigram counts depending on whitespace contents.
229 * Numbers based on analysis of real-life texts.
230 */
231static const float4 penalties[8] = {
232 1.0f, /* "aaa" */
233 3.5f, /* "aa " */
234 0.0f, /* "a a" (impossible) */
235 0.0f, /* "a " (impossible) */
236 4.2f, /* " aa" */
237 2.1f, /* " a " */
238 25.0f, /* " a" */
239 0.0f /* " " (impossible) */
240};
241
242/* Struct representing a single pg_wchar, converted back to multibyte form */
243typedef struct
244{
247
248/*
249 * Attributes of NFA colors:
250 *
251 * expandable - we know the character expansion of this color
252 * containsNonWord - color contains non-word characters
253 * (which will not be extracted into trigrams)
254 * wordCharsCount - count of word characters in color
255 * wordChars - array of this color's word characters
256 * (which can be extracted into trigrams)
257 *
258 * When expandable is false, the other attributes don't matter; we just
259 * assume this color represents unknown character(s).
260 */
261typedef struct
262{
268
269/*
270 * A "prefix" is information about the colors of the last two characters read
271 * before reaching a specific NFA state. These colors can have special values
272 * COLOR_UNKNOWN and COLOR_BLANK. COLOR_UNKNOWN means that we have no
273 * information, for example because we read some character of an unexpandable
274 * color. COLOR_BLANK means that we read a non-word character.
275 *
276 * We call a prefix ambiguous if at least one of its colors is unknown. It's
277 * fully ambiguous if both are unknown, partially ambiguous if only the first
278 * is unknown. (The case of first color known, second unknown is not valid.)
279 *
280 * Wholly- or partly-blank prefixes are mostly handled the same as regular
281 * color prefixes. This allows us to generate appropriate partly-blank
282 * trigrams when the NFA requires word character(s) to appear adjacent to
283 * non-word character(s).
284 */
285typedef int TrgmColor;
286
287/* We assume that colors returned by the regexp engine cannot be these: */
288#define COLOR_UNKNOWN (-3)
289#define COLOR_BLANK (-4)
290
291typedef struct
292{
293 TrgmColor colors[2];
294} TrgmPrefix;
295
296/*
297 * Color-trigram data type. Note that some elements of the trigram can be
298 * COLOR_BLANK, but we don't allow COLOR_UNKNOWN.
299 */
300typedef struct
301{
302 TrgmColor colors[3];
303} ColorTrgm;
304
305/*
306 * Key identifying a state of our expanded graph: color prefix, and number
307 * of the corresponding state in the underlying regex NFA. The color prefix
308 * shows how we reached the regex state (to the extent that we know it).
309 */
310typedef struct
311{
315
316/*
317 * One state of the expanded graph.
318 *
319 * stateKey - ID of this state
320 * arcs - outgoing arcs of this state (List of TrgmArc)
321 * enterKeys - enter keys reachable from this state without reading any
322 * predictable trigram (List of TrgmStateKey)
323 * flags - flag bits
324 * snumber - number of this state (initially assigned as -1, -2, etc,
325 * for debugging purposes only; then at the packaging stage,
326 * surviving states are renumbered with positive numbers)
327 * parent - parent state, if this state has been merged into another
328 * tentFlags - flags this state would acquire via planned merges
329 * tentParent - planned parent state, if considering a merge
330 */
331#define TSTATE_INIT 0x01 /* flag indicating this state is initial */
332#define TSTATE_FIN 0x02 /* flag indicating this state is final */
333
334typedef struct TrgmState
335{
336 TrgmStateKey stateKey; /* hashtable key: must be first field */
339 int flags;
345
346/*
347 * One arc in the expanded graph.
348 */
349typedef struct
350{
351 ColorTrgm ctrgm; /* trigram needed to traverse arc */
352 TrgmState *target; /* next state */
353} TrgmArc;
354
355/*
356 * Information about arc of specific color trigram (used in stage 3)
357 *
358 * Contains pointers to the source and target states.
359 */
360typedef struct
361{
365
366/*
367 * Information about color trigram (used in stage 3)
368 *
369 * ctrgm - trigram itself
370 * cnumber - number of this trigram (used in the packaging stage)
371 * count - number of simple trigrams created from this color trigram
372 * expanded - indicates this color trigram is expanded into simple trigrams
373 * arcs - list of all arcs labeled with this color trigram.
374 */
375typedef struct
376{
379 int count;
384
385/*
386 * Data structure representing all the data we need during regex processing.
387 *
388 * regex - compiled regex
389 * colorInfo - extracted information about regex's colors
390 * ncolors - number of colors in colorInfo[]
391 * states - hashtable of TrgmStates (states of expanded graph)
392 * initState - pointer to initial state of expanded graph
393 * queue - queue of to-be-processed TrgmStates
394 * keysQueue - queue of to-be-processed TrgmStateKeys
395 * arcsCount - total number of arcs of expanded graph (for resource
396 * limiting)
397 * overflowed - we have exceeded resource limit for transformation
398 * colorTrgms - array of all color trigrams present in graph
399 * colorTrgmsCount - count of those color trigrams
400 * totalTrgmCount - total count of extracted simple trigrams
401 */
402typedef struct
403{
404 /* Source regexp, and color information extracted from it (stage 1) */
408
409 /* Expanded graph (stage 2) */
413
414 /* Workspace for stage 2 */
419
420 /* Information about distinct color trigrams in the graph (stage 3) */
424} TrgmNFA;
425
426/*
427 * Final, compact representation of expanded graph.
428 */
429typedef struct
430{
431 int targetState; /* index of target state (zero-based) */
432 int colorTrgm; /* index of color trigram for transition */
434
435typedef struct
436{
437 int arcsCount; /* number of out-arcs for this state */
438 TrgmPackedArc *arcs; /* array of arcsCount packed arcs */
440
441/* "typedef struct TrgmPackedGraph TrgmPackedGraph" appears in trgm.h */
443{
444 /*
445 * colorTrigramsCount and colorTrigramGroups contain information about how
446 * trigrams are grouped into color trigrams. "colorTrigramsCount" is the
447 * count of color trigrams and "colorTrigramGroups" contains number of
448 * simple trigrams for each color trigram. The array of simple trigrams
449 * (stored separately from this struct) is ordered so that the simple
450 * trigrams for each color trigram are consecutive, and they're in order
451 * by color trigram number.
452 */
454 int *colorTrigramGroups; /* array of size colorTrigramsCount */
455
456 /*
457 * The states of the simplified NFA. State number 0 is always initial
458 * state and state number 1 is always final state.
459 */
461 TrgmPackedState *states; /* array of size statesCount */
462
463 /* Temporary work space for trigramsMatchGraph() */
464 bool *colorTrigramsActive; /* array of size colorTrigramsCount */
465 bool *statesActive; /* array of size statesCount */
466 int *statesQueue; /* array of size statesCount */
467};
468
469/*
470 * Temporary structure for representing an arc during packaging.
471 */
472typedef struct
473{
478
479
480/* prototypes for private functions */
481static TRGM *createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph,
482 MemoryContext rcontext);
483static void RE_compile(regex_t *regex, text *text_re,
484 int cflags, Oid collation);
485static void getColorInfo(regex_t *regex, TrgmNFA *trgmNFA);
486static bool convertPgWchar(pg_wchar c, trgm_mb_char *result);
487static void transformGraph(TrgmNFA *trgmNFA);
488static void processState(TrgmNFA *trgmNFA, TrgmState *state);
489static void addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key);
490static void addKeyToQueue(TrgmNFA *trgmNFA, TrgmStateKey *key);
491static void addArcs(TrgmNFA *trgmNFA, TrgmState *state);
492static void addArc(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key,
493 TrgmColor co, TrgmStateKey *destKey);
494static bool validArcLabel(TrgmStateKey *key, TrgmColor co);
495static TrgmState *getState(TrgmNFA *trgmNFA, TrgmStateKey *key);
496static bool prefixContains(TrgmPrefix *prefix1, TrgmPrefix *prefix2);
497static bool selectColorTrigrams(TrgmNFA *trgmNFA);
498static TRGM *expandColorTrigrams(TrgmNFA *trgmNFA, MemoryContext rcontext);
499static void fillTrgm(trgm *ptrgm, trgm_mb_char s[3]);
500static void mergeStates(TrgmState *state1, TrgmState *state2);
501static int colorTrgmInfoCmp(const void *p1, const void *p2);
502static int colorTrgmInfoPenaltyCmp(const void *p1, const void *p2);
503static TrgmPackedGraph *packGraph(TrgmNFA *trgmNFA, MemoryContext rcontext);
504static int packArcInfoCmp(const void *a1, const void *a2);
505
506#ifdef TRGM_REGEXP_DEBUG
507static void printSourceNFA(regex_t *regex, TrgmColorInfo *colors, int ncolors);
508static void printTrgmNFA(TrgmNFA *trgmNFA);
509static void printTrgmColor(StringInfo buf, TrgmColor co);
510static void printTrgmPackedGraph(TrgmPackedGraph *packedGraph, TRGM *trigrams);
511#endif
512
513
514/*
515 * Main entry point to process a regular expression.
516 *
517 * Returns an array of trigrams required by the regular expression, or NULL if
518 * the regular expression was too complex to analyze. In addition, a packed
519 * graph representation of the regex is returned into *graph. The results
520 * must be allocated in rcontext (which might or might not be the current
521 * context).
522 */
523TRGM *
524createTrgmNFA(text *text_re, Oid collation,
525 TrgmPackedGraph **graph, MemoryContext rcontext)
526{
527 TRGM *trg;
528 regex_t regex;
529 MemoryContext tmpcontext;
530 MemoryContext oldcontext;
531
532 /*
533 * This processing generates a great deal of cruft, which we'd like to
534 * clean up before returning (since this function may be called in a
535 * query-lifespan memory context). Make a temp context we can work in so
536 * that cleanup is easy.
537 */
539 "createTrgmNFA temporary context",
541 oldcontext = MemoryContextSwitchTo(tmpcontext);
542
543 /*
544 * Stage 1: Compile the regexp into a NFA, using the regexp library.
545 */
546#ifdef IGNORECASE
547 RE_compile(&regex, text_re,
548 REG_ADVANCED | REG_NOSUB | REG_ICASE, collation);
549#else
550 RE_compile(&regex, text_re,
551 REG_ADVANCED | REG_NOSUB, collation);
552#endif
553
554 trg = createTrgmNFAInternal(&regex, graph, rcontext);
555
556 /* Clean up all the cruft we created (including regex) */
557 MemoryContextSwitchTo(oldcontext);
558 MemoryContextDelete(tmpcontext);
559
560 return trg;
561}
562
563/*
564 * Body of createTrgmNFA, exclusive of regex compilation/freeing.
565 */
566static TRGM *
568 MemoryContext rcontext)
569{
570 TRGM *trg;
571 TrgmNFA trgmNFA;
572
573 trgmNFA.regex = regex;
574
575 /* Collect color information from the regex */
576 getColorInfo(regex, &trgmNFA);
577
578#ifdef TRGM_REGEXP_DEBUG
579 printSourceNFA(regex, trgmNFA.colorInfo, trgmNFA.ncolors);
580#endif
581
582 /*
583 * Stage 2: Create an expanded graph from the source NFA.
584 */
585 transformGraph(&trgmNFA);
586
587#ifdef TRGM_REGEXP_DEBUG
588 printTrgmNFA(&trgmNFA);
589#endif
590
591 /*
592 * Fail if we were unable to make a nontrivial graph, ie it is possible to
593 * get from the initial state to the final state without reading any
594 * predictable trigram.
595 */
596 if (trgmNFA.initState->flags & TSTATE_FIN)
597 return NULL;
598
599 /*
600 * Stage 3: Select color trigrams to expand. Fail if too many trigrams.
601 */
602 if (!selectColorTrigrams(&trgmNFA))
603 return NULL;
604
605 /*
606 * Stage 4: Expand color trigrams and pack graph into final
607 * representation.
608 */
609 trg = expandColorTrigrams(&trgmNFA, rcontext);
610
611 *graph = packGraph(&trgmNFA, rcontext);
612
613#ifdef TRGM_REGEXP_DEBUG
614 printTrgmPackedGraph(*graph, trg);
615#endif
616
617 return trg;
618}
619
620/*
621 * Main entry point for evaluating a graph during index scanning.
622 *
623 * The check[] array is indexed by trigram number (in the array of simple
624 * trigrams returned by createTrgmNFA), and holds true for those trigrams
625 * that are present in the index entry being checked.
626 */
627bool
629{
630 int i,
631 j,
632 k,
633 queueIn,
634 queueOut;
635
636 /*
637 * Reset temporary working areas.
638 */
639 memset(graph->colorTrigramsActive, 0,
640 sizeof(bool) * graph->colorTrigramsCount);
641 memset(graph->statesActive, 0, sizeof(bool) * graph->statesCount);
642
643 /*
644 * Check which color trigrams were matched. A match for any simple
645 * trigram associated with a color trigram counts as a match of the color
646 * trigram.
647 */
648 j = 0;
649 for (i = 0; i < graph->colorTrigramsCount; i++)
650 {
651 int cnt = graph->colorTrigramGroups[i];
652
653 for (k = j; k < j + cnt; k++)
654 {
655 if (check[k])
656 {
657 /*
658 * Found one matched trigram in the group. Can skip the rest
659 * of them and go to the next group.
660 */
661 graph->colorTrigramsActive[i] = true;
662 break;
663 }
664 }
665 j = j + cnt;
666 }
667
668 /*
669 * Initialize the statesQueue to hold just the initial state. Note:
670 * statesQueue has room for statesCount entries, which is certainly enough
671 * since no state will be put in the queue more than once. The
672 * statesActive array marks which states have been queued.
673 */
674 graph->statesActive[0] = true;
675 graph->statesQueue[0] = 0;
676 queueIn = 0;
677 queueOut = 1;
678
679 /* Process queued states as long as there are any. */
680 while (queueIn < queueOut)
681 {
682 int stateno = graph->statesQueue[queueIn++];
683 TrgmPackedState *state = &graph->states[stateno];
684 int cnt = state->arcsCount;
685
686 /* Loop over state's out-arcs */
687 for (i = 0; i < cnt; i++)
688 {
689 TrgmPackedArc *arc = &state->arcs[i];
690
691 /*
692 * If corresponding color trigram is present then activate the
693 * corresponding state. We're done if that's the final state,
694 * otherwise queue the state if it's not been queued already.
695 */
696 if (graph->colorTrigramsActive[arc->colorTrgm])
697 {
698 int nextstate = arc->targetState;
699
700 if (nextstate == 1)
701 return true; /* success: final state is reachable */
702
703 if (!graph->statesActive[nextstate])
704 {
705 graph->statesActive[nextstate] = true;
706 graph->statesQueue[queueOut++] = nextstate;
707 }
708 }
709 }
710 }
711
712 /* Queue is empty, so match fails. */
713 return false;
714}
715
716/*
717 * Compile regex string into struct at *regex.
718 * NB: pg_regfree must be applied to regex if this completes successfully.
719 */
720static void
721RE_compile(regex_t *regex, text *text_re, int cflags, Oid collation)
722{
723 int text_re_len = VARSIZE_ANY_EXHDR(text_re);
724 char *text_re_val = VARDATA_ANY(text_re);
725 pg_wchar *pattern;
726 int pattern_len;
727 int regcomp_result;
728 char errMsg[100];
729
730 /* Convert pattern string to wide characters */
731 pattern = (pg_wchar *) palloc((text_re_len + 1) * sizeof(pg_wchar));
732 pattern_len = pg_mb2wchar_with_len(text_re_val,
733 pattern,
734 text_re_len);
735
736 /* Compile regex */
737 regcomp_result = pg_regcomp(regex,
738 pattern,
739 pattern_len,
740 cflags,
741 collation);
742
743 pfree(pattern);
744
745 if (regcomp_result != REG_OKAY)
746 {
747 /* re didn't compile (no need for pg_regfree, if so) */
748 pg_regerror(regcomp_result, regex, errMsg, sizeof(errMsg));
750 (errcode(ERRCODE_INVALID_REGULAR_EXPRESSION),
751 errmsg("invalid regular expression: %s", errMsg)));
752 }
753}
754
755
756/*---------------------
757 * Subroutines for pre-processing the color map (stage 1).
758 *---------------------
759 */
760
761/*
762 * Fill TrgmColorInfo structure for each color using regex export functions.
763 */
764static void
765getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
766{
767 int colorsCount = pg_reg_getnumcolors(regex);
768 int i;
769
770 trgmNFA->ncolors = colorsCount;
771 trgmNFA->colorInfo = (TrgmColorInfo *)
772 palloc0(colorsCount * sizeof(TrgmColorInfo));
773
774 /*
775 * Loop over colors, filling TrgmColorInfo about each. Note we include
776 * WHITE (0) even though we know it'll be reported as non-expandable.
777 */
778 for (i = 0; i < colorsCount; i++)
779 {
780 TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[i];
781 int charsCount = pg_reg_getnumcharacters(regex, i);
783 int j;
784
785 if (charsCount < 0 || charsCount > COLOR_COUNT_LIMIT)
786 {
787 /* Non expandable, or too large to work with */
788 colorInfo->expandable = false;
789 continue;
790 }
791
792 colorInfo->expandable = true;
793 colorInfo->containsNonWord = false;
794 colorInfo->wordChars = (trgm_mb_char *)
795 palloc(sizeof(trgm_mb_char) * charsCount);
796 colorInfo->wordCharsCount = 0;
797
798 /* Extract all the chars in this color */
799 chars = (pg_wchar *) palloc(sizeof(pg_wchar) * charsCount);
800 pg_reg_getcharacters(regex, i, chars, charsCount);
801
802 /*
803 * Convert characters back to multibyte form, and save only those that
804 * are word characters. Set "containsNonWord" if any non-word
805 * character. (Note: it'd probably be nicer to keep the chars in
806 * pg_wchar format for now, but ISWORDCHR wants to see multibyte.)
807 */
808 for (j = 0; j < charsCount; j++)
809 {
811
812 if (!convertPgWchar(chars[j], &c))
813 continue; /* ok to ignore it altogether */
814 if (ISWORDCHR(c.bytes))
815 colorInfo->wordChars[colorInfo->wordCharsCount++] = c;
816 else
817 colorInfo->containsNonWord = true;
818 }
819
820 pfree(chars);
821 }
822}
823
824/*
825 * Convert pg_wchar to multibyte format.
826 * Returns false if the character should be ignored completely.
827 */
828static bool
830{
831 /* "s" has enough space for a multibyte character and a trailing NUL */
832 char s[MAX_MULTIBYTE_CHAR_LEN + 1];
833
834 /*
835 * We can ignore the NUL character, since it can never appear in a PG text
836 * string. This avoids the need for various special cases when
837 * reconstructing trigrams.
838 */
839 if (c == 0)
840 return false;
841
842 /* Do the conversion, making sure the result is NUL-terminated */
843 memset(s, 0, sizeof(s));
844 pg_wchar2mb_with_len(&c, s, 1);
845
846 /*
847 * In IGNORECASE mode, we can ignore uppercase characters. We assume that
848 * the regex engine generated both uppercase and lowercase equivalents
849 * within each color, since we used the REG_ICASE option; so there's no
850 * need to process the uppercase version.
851 *
852 * XXX this code is dependent on the assumption that str_tolower() works
853 * the same as the regex engine's internal case folding machinery. Might
854 * be wiser to expose pg_wc_tolower and test whether c ==
855 * pg_wc_tolower(c). On the other hand, the trigrams in the index were
856 * created using str_tolower(), so we're probably screwed if there's any
857 * incompatibility anyway.
858 */
859#ifdef IGNORECASE
860 {
861 char *lowerCased = str_tolower(s, strlen(s), DEFAULT_COLLATION_OID);
862
863 if (strcmp(lowerCased, s) != 0)
864 {
865 pfree(lowerCased);
866 return false;
867 }
868 pfree(lowerCased);
869 }
870#endif
871
872 /* Fill result with exactly MAX_MULTIBYTE_CHAR_LEN bytes */
873 memcpy(result->bytes, s, MAX_MULTIBYTE_CHAR_LEN);
874 return true;
875}
876
877
878/*---------------------
879 * Subroutines for expanding original NFA graph into a trigram graph (stage 2).
880 *---------------------
881 */
882
883/*
884 * Transform the graph, given a regex and extracted color information.
885 *
886 * We create and process a queue of expanded-graph states until all the states
887 * are processed.
888 *
889 * This algorithm may be stopped due to resource limitation. In this case we
890 * force every unprocessed branch to immediately finish with matching (this
891 * can give us false positives but no false negatives) by marking all
892 * unprocessed states as final.
893 */
894static void
896{
897 HASHCTL hashCtl;
898 TrgmStateKey initkey;
899 TrgmState *initstate;
900 ListCell *lc;
901
902 /* Initialize this stage's workspace in trgmNFA struct */
903 trgmNFA->queue = NIL;
904 trgmNFA->keysQueue = NIL;
905 trgmNFA->arcsCount = 0;
906 trgmNFA->overflowed = false;
907
908 /* Create hashtable for states */
909 hashCtl.keysize = sizeof(TrgmStateKey);
910 hashCtl.entrysize = sizeof(TrgmState);
911 hashCtl.hcxt = CurrentMemoryContext;
912 trgmNFA->states = hash_create("Trigram NFA",
913 1024,
914 &hashCtl,
916 trgmNFA->nstates = 0;
917
918 /* Create initial state: ambiguous prefix, NFA's initial state */
919 MemSet(&initkey, 0, sizeof(initkey));
920 initkey.prefix.colors[0] = COLOR_UNKNOWN;
921 initkey.prefix.colors[1] = COLOR_UNKNOWN;
922 initkey.nstate = pg_reg_getinitialstate(trgmNFA->regex);
923
924 initstate = getState(trgmNFA, &initkey);
925 initstate->flags |= TSTATE_INIT;
926 trgmNFA->initState = initstate;
927
928 /*
929 * Recursively build the expanded graph by processing queue of states
930 * (breadth-first search). getState already put initstate in the queue.
931 * Note that getState will append new states to the queue within the loop,
932 * too; this works as long as we don't do repeat fetches using the "lc"
933 * pointer.
934 */
935 foreach(lc, trgmNFA->queue)
936 {
937 TrgmState *state = (TrgmState *) lfirst(lc);
938
939 /*
940 * If we overflowed then just mark state as final. Otherwise do
941 * actual processing.
942 */
943 if (trgmNFA->overflowed)
944 state->flags |= TSTATE_FIN;
945 else
946 processState(trgmNFA, state);
947
948 /* Did we overflow? */
949 if (trgmNFA->arcsCount > MAX_EXPANDED_ARCS ||
951 trgmNFA->overflowed = true;
952 }
953}
954
955/*
956 * Process one state: add enter keys and then add outgoing arcs.
957 */
958static void
960{
961 ListCell *lc;
962
963 /* keysQueue should be NIL already, but make sure */
964 trgmNFA->keysQueue = NIL;
965
966 /*
967 * Add state's own key, and then process all keys added to keysQueue until
968 * queue is finished. But we can quit if the state gets marked final.
969 */
970 addKey(trgmNFA, state, &state->stateKey);
971 foreach(lc, trgmNFA->keysQueue)
972 {
974
975 if (state->flags & TSTATE_FIN)
976 break;
977 addKey(trgmNFA, state, key);
978 }
979
980 /* Release keysQueue to clean up for next cycle */
981 list_free(trgmNFA->keysQueue);
982 trgmNFA->keysQueue = NIL;
983
984 /*
985 * Add outgoing arcs only if state isn't final (we have no interest in
986 * outgoing arcs if we already match)
987 */
988 if (!(state->flags & TSTATE_FIN))
989 addArcs(trgmNFA, state);
990}
991
992/*
993 * Add the given enter key into the state's enterKeys list, and determine
994 * whether this should result in any further enter keys being added.
995 * If so, add those keys to keysQueue so that processState will handle them.
996 *
997 * If the enter key is for the NFA's final state, mark state as TSTATE_FIN.
998 * This situation means that we can reach the final state from this expanded
999 * state without reading any predictable trigram, so we must consider this
1000 * state as an accepting one.
1001 *
1002 * The given key could be a duplicate of one already in enterKeys, or be
1003 * redundant with some enterKeys. So we check that before doing anything.
1004 *
1005 * Note that we don't generate any actual arcs here. addArcs will do that
1006 * later, after we have identified all the enter keys for this state.
1007 */
1008static void
1010{
1011 regex_arc_t *arcs;
1012 TrgmStateKey destKey;
1013 ListCell *cell;
1014 int i,
1015 arcsCount;
1016
1017 /*
1018 * Ensure any pad bytes in destKey are zero, since it may get used as a
1019 * hashtable key by getState.
1020 */
1021 MemSet(&destKey, 0, sizeof(destKey));
1022
1023 /*
1024 * Compare key to each existing enter key of the state to check for
1025 * redundancy. We can drop either old key(s) or the new key if we find
1026 * redundancy.
1027 */
1028 foreach(cell, state->enterKeys)
1029 {
1030 TrgmStateKey *existingKey = (TrgmStateKey *) lfirst(cell);
1031
1032 if (existingKey->nstate == key->nstate)
1033 {
1034 if (prefixContains(&existingKey->prefix, &key->prefix))
1035 {
1036 /* This old key already covers the new key. Nothing to do */
1037 return;
1038 }
1039 if (prefixContains(&key->prefix, &existingKey->prefix))
1040 {
1041 /*
1042 * The new key covers this old key. Remove the old key, it's
1043 * no longer needed once we add this key to the list.
1044 */
1045 state->enterKeys = foreach_delete_current(state->enterKeys,
1046 cell);
1047 }
1048 }
1049 }
1050
1051 /* No redundancy, so add this key to the state's list */
1052 state->enterKeys = lappend(state->enterKeys, key);
1053
1054 /* If state is now known final, mark it and we're done */
1055 if (key->nstate == pg_reg_getfinalstate(trgmNFA->regex))
1056 {
1057 state->flags |= TSTATE_FIN;
1058 return;
1059 }
1060
1061 /*
1062 * Loop through all outgoing arcs of the corresponding state in the
1063 * original NFA.
1064 */
1065 arcsCount = pg_reg_getnumoutarcs(trgmNFA->regex, key->nstate);
1066 arcs = (regex_arc_t *) palloc(sizeof(regex_arc_t) * arcsCount);
1067 pg_reg_getoutarcs(trgmNFA->regex, key->nstate, arcs, arcsCount);
1068
1069 for (i = 0; i < arcsCount; i++)
1070 {
1071 regex_arc_t *arc = &arcs[i];
1072
1073 if (pg_reg_colorisbegin(trgmNFA->regex, arc->co))
1074 {
1075 /*
1076 * Start of line/string (^). Trigram extraction treats start of
1077 * line same as start of word: double space prefix is added.
1078 * Hence, make an enter key showing we can reach the arc
1079 * destination with all-blank prefix.
1080 */
1081 destKey.prefix.colors[0] = COLOR_BLANK;
1082 destKey.prefix.colors[1] = COLOR_BLANK;
1083 destKey.nstate = arc->to;
1084
1085 /* Add enter key to this state */
1086 addKeyToQueue(trgmNFA, &destKey);
1087 }
1088 else if (pg_reg_colorisend(trgmNFA->regex, arc->co))
1089 {
1090 /*
1091 * End of line/string ($). We must consider this arc as a
1092 * transition that doesn't read anything. The reason for adding
1093 * this enter key to the state is that if the arc leads to the
1094 * NFA's final state, we must mark this expanded state as final.
1095 */
1096 destKey.prefix.colors[0] = COLOR_UNKNOWN;
1097 destKey.prefix.colors[1] = COLOR_UNKNOWN;
1098 destKey.nstate = arc->to;
1099
1100 /* Add enter key to this state */
1101 addKeyToQueue(trgmNFA, &destKey);
1102 }
1103 else if (arc->co >= 0)
1104 {
1105 /* Regular color (including WHITE) */
1106 TrgmColorInfo *colorInfo = &trgmNFA->colorInfo[arc->co];
1107
1108 if (colorInfo->expandable)
1109 {
1110 if (colorInfo->containsNonWord &&
1112 {
1113 /*
1114 * We can reach the arc destination after reading a
1115 * non-word character, but the prefix is not something
1116 * that addArc will accept with COLOR_BLANK, so no trigram
1117 * arc can get made for this transition. We must make an
1118 * enter key to show that the arc destination is
1119 * reachable. Set it up with an all-blank prefix, since
1120 * that corresponds to what the trigram extraction code
1121 * will do at a word starting boundary.
1122 */
1123 destKey.prefix.colors[0] = COLOR_BLANK;
1124 destKey.prefix.colors[1] = COLOR_BLANK;
1125 destKey.nstate = arc->to;
1126 addKeyToQueue(trgmNFA, &destKey);
1127 }
1128
1129 if (colorInfo->wordCharsCount > 0 &&
1130 !validArcLabel(key, arc->co))
1131 {
1132 /*
1133 * We can reach the arc destination after reading a word
1134 * character, but the prefix is not something that addArc
1135 * will accept, so no trigram arc can get made for this
1136 * transition. We must make an enter key to show that the
1137 * arc destination is reachable. The prefix for the enter
1138 * key should reflect the info we have for this arc.
1139 */
1140 destKey.prefix.colors[0] = key->prefix.colors[1];
1141 destKey.prefix.colors[1] = arc->co;
1142 destKey.nstate = arc->to;
1143 addKeyToQueue(trgmNFA, &destKey);
1144 }
1145 }
1146 else
1147 {
1148 /*
1149 * Unexpandable color. Add enter key with ambiguous prefix,
1150 * showing we can reach the destination from this state, but
1151 * the preceding colors will be uncertain. (We do not set the
1152 * first prefix color to key->prefix.colors[1], because a
1153 * prefix of known followed by unknown is invalid.)
1154 */
1155 destKey.prefix.colors[0] = COLOR_UNKNOWN;
1156 destKey.prefix.colors[1] = COLOR_UNKNOWN;
1157 destKey.nstate = arc->to;
1158 addKeyToQueue(trgmNFA, &destKey);
1159 }
1160 }
1161 else
1162 {
1163 /* RAINBOW: treat as unexpandable color */
1164 destKey.prefix.colors[0] = COLOR_UNKNOWN;
1165 destKey.prefix.colors[1] = COLOR_UNKNOWN;
1166 destKey.nstate = arc->to;
1167 addKeyToQueue(trgmNFA, &destKey);
1168 }
1169 }
1170
1171 pfree(arcs);
1172}
1173
1174/*
1175 * Add copy of given key to keysQueue for later processing.
1176 */
1177static void
1179{
1180 TrgmStateKey *keyCopy = (TrgmStateKey *) palloc(sizeof(TrgmStateKey));
1181
1182 memcpy(keyCopy, key, sizeof(TrgmStateKey));
1183 trgmNFA->keysQueue = lappend(trgmNFA->keysQueue, keyCopy);
1184}
1185
1186/*
1187 * Add outgoing arcs from given state, whose enter keys are all now known.
1188 */
1189static void
1191{
1192 TrgmStateKey destKey;
1193 ListCell *cell;
1194 regex_arc_t *arcs;
1195 int arcsCount,
1196 i;
1197
1198 /*
1199 * Ensure any pad bytes in destKey are zero, since it may get used as a
1200 * hashtable key by getState.
1201 */
1202 MemSet(&destKey, 0, sizeof(destKey));
1203
1204 /*
1205 * Iterate over enter keys associated with this expanded-graph state. This
1206 * includes both the state's own stateKey, and any enter keys we added to
1207 * it during addKey (which represent expanded-graph states that are not
1208 * distinguishable from this one by means of trigrams). For each such
1209 * enter key, examine all the out-arcs of the key's underlying NFA state,
1210 * and try to make a trigram arc leading to where the out-arc leads.
1211 * (addArc will deal with whether the arc is valid or not.)
1212 */
1213 foreach(cell, state->enterKeys)
1214 {
1215 TrgmStateKey *key = (TrgmStateKey *) lfirst(cell);
1216
1217 arcsCount = pg_reg_getnumoutarcs(trgmNFA->regex, key->nstate);
1218 arcs = (regex_arc_t *) palloc(sizeof(regex_arc_t) * arcsCount);
1219 pg_reg_getoutarcs(trgmNFA->regex, key->nstate, arcs, arcsCount);
1220
1221 for (i = 0; i < arcsCount; i++)
1222 {
1223 regex_arc_t *arc = &arcs[i];
1224 TrgmColorInfo *colorInfo;
1225
1226 /*
1227 * Ignore non-expandable colors; addKey already handled the case.
1228 *
1229 * We need no special check for WHITE or begin/end pseudocolors
1230 * here. We don't need to do any processing for them, and they
1231 * will be marked non-expandable since the regex engine will have
1232 * reported them that way. We do have to watch out for RAINBOW,
1233 * which has a negative color number.
1234 */
1235 if (arc->co < 0)
1236 continue;
1237 Assert(arc->co < trgmNFA->ncolors);
1238
1239 colorInfo = &trgmNFA->colorInfo[arc->co];
1240 if (!colorInfo->expandable)
1241 continue;
1242
1243 if (colorInfo->containsNonWord)
1244 {
1245 /*
1246 * Color includes non-word character(s).
1247 *
1248 * Generate an arc, treating this transition as occurring on
1249 * BLANK. This allows word-ending trigrams to be manufactured
1250 * if possible.
1251 */
1252 destKey.prefix.colors[0] = key->prefix.colors[1];
1253 destKey.prefix.colors[1] = COLOR_BLANK;
1254 destKey.nstate = arc->to;
1255
1256 addArc(trgmNFA, state, key, COLOR_BLANK, &destKey);
1257 }
1258
1259 if (colorInfo->wordCharsCount > 0)
1260 {
1261 /*
1262 * Color includes word character(s).
1263 *
1264 * Generate an arc. Color is pushed into prefix of target
1265 * state.
1266 */
1267 destKey.prefix.colors[0] = key->prefix.colors[1];
1268 destKey.prefix.colors[1] = arc->co;
1269 destKey.nstate = arc->to;
1270
1271 addArc(trgmNFA, state, key, arc->co, &destKey);
1272 }
1273 }
1274
1275 pfree(arcs);
1276 }
1277}
1278
1279/*
1280 * Generate an out-arc of the expanded graph, if it's valid and not redundant.
1281 *
1282 * state: expanded-graph state we want to add an out-arc to
1283 * key: provides prefix colors (key->nstate is not used)
1284 * co: transition color
1285 * destKey: identifier for destination state of expanded graph
1286 */
1287static void
1289 TrgmColor co, TrgmStateKey *destKey)
1290{
1291 TrgmArc *arc;
1292 ListCell *cell;
1293
1294 /* Do nothing if this wouldn't be a valid arc label trigram */
1295 if (!validArcLabel(key, co))
1296 return;
1297
1298 /*
1299 * Check if we are going to reach key which is covered by a key which is
1300 * already listed in this state. If so arc is useless: the NFA can bypass
1301 * it through a path that doesn't require any predictable trigram, so
1302 * whether the arc's trigram is present or not doesn't really matter.
1303 */
1304 foreach(cell, state->enterKeys)
1305 {
1306 TrgmStateKey *existingKey = (TrgmStateKey *) lfirst(cell);
1307
1308 if (existingKey->nstate == destKey->nstate &&
1309 prefixContains(&existingKey->prefix, &destKey->prefix))
1310 return;
1311 }
1312
1313 /* Checks were successful, add new arc */
1314 arc = (TrgmArc *) palloc(sizeof(TrgmArc));
1315 arc->target = getState(trgmNFA, destKey);
1316 arc->ctrgm.colors[0] = key->prefix.colors[0];
1317 arc->ctrgm.colors[1] = key->prefix.colors[1];
1318 arc->ctrgm.colors[2] = co;
1319
1320 state->arcs = lappend(state->arcs, arc);
1321 trgmNFA->arcsCount++;
1322}
1323
1324/*
1325 * Can we make a valid trigram arc label from the given prefix and arc color?
1326 *
1327 * This is split out so that tests in addKey and addArc will stay in sync.
1328 */
1329static bool
1331{
1332 /*
1333 * We have to know full trigram in order to add outgoing arc. So we can't
1334 * do it if prefix is ambiguous.
1335 */
1336 if (key->prefix.colors[0] == COLOR_UNKNOWN)
1337 return false;
1338
1339 /* If key->prefix.colors[0] isn't unknown, its second color isn't either */
1340 Assert(key->prefix.colors[1] != COLOR_UNKNOWN);
1341 /* And we should not be called with an unknown arc color anytime */
1342 Assert(co != COLOR_UNKNOWN);
1343
1344 /*
1345 * We don't bother with making arcs representing three non-word
1346 * characters, since that's useless for trigram extraction.
1347 */
1348 if (key->prefix.colors[0] == COLOR_BLANK &&
1349 key->prefix.colors[1] == COLOR_BLANK &&
1350 co == COLOR_BLANK)
1351 return false;
1352
1353 /*
1354 * We also reject nonblank-blank-anything. The nonblank-blank-nonblank
1355 * case doesn't correspond to any trigram the trigram extraction code
1356 * would make. The nonblank-blank-blank case is also not possible with
1357 * RPADDING = 1. (Note that in many cases we'd fail to generate such a
1358 * trigram even if it were valid, for example processing "foo bar" will
1359 * not result in considering the trigram "o ". So if you want to support
1360 * RPADDING = 2, there's more to do than just twiddle this test.)
1361 */
1362 if (key->prefix.colors[0] != COLOR_BLANK &&
1363 key->prefix.colors[1] == COLOR_BLANK)
1364 return false;
1365
1366 /*
1367 * Other combinations involving blank are valid, in particular we assume
1368 * blank-blank-nonblank is valid, which presumes that LPADDING is 2.
1369 *
1370 * Note: Using again the example "foo bar", we will not consider the
1371 * trigram " b", though this trigram would be found by the trigram
1372 * extraction code. Since we will find " ba", it doesn't seem worth
1373 * trying to hack the algorithm to generate the additional trigram.
1374 */
1375
1376 /* arc label is valid */
1377 return true;
1378}
1379
1380/*
1381 * Get state of expanded graph for given state key,
1382 * and queue the state for processing if it didn't already exist.
1383 */
1384static TrgmState *
1386{
1388 bool found;
1389
1391 &found);
1392 if (!found)
1393 {
1394 /* New state: initialize and queue it */
1395 state->arcs = NIL;
1396 state->enterKeys = NIL;
1397 state->flags = 0;
1398 /* states are initially given negative numbers */
1399 state->snumber = -(++trgmNFA->nstates);
1400 state->parent = NULL;
1401 state->tentFlags = 0;
1402 state->tentParent = NULL;
1403
1404 trgmNFA->queue = lappend(trgmNFA->queue, state);
1405 }
1406 return state;
1407}
1408
1409/*
1410 * Check if prefix1 "contains" prefix2.
1411 *
1412 * "contains" means that any exact prefix (with no ambiguity) that satisfies
1413 * prefix2 also satisfies prefix1.
1414 */
1415static bool
1417{
1418 if (prefix1->colors[1] == COLOR_UNKNOWN)
1419 {
1420 /* Fully ambiguous prefix contains everything */
1421 return true;
1422 }
1423 else if (prefix1->colors[0] == COLOR_UNKNOWN)
1424 {
1425 /*
1426 * Prefix with only first unknown color contains every prefix with
1427 * same second color.
1428 */
1429 if (prefix1->colors[1] == prefix2->colors[1])
1430 return true;
1431 else
1432 return false;
1433 }
1434 else
1435 {
1436 /* Exact prefix contains only the exact same prefix */
1437 if (prefix1->colors[0] == prefix2->colors[0] &&
1438 prefix1->colors[1] == prefix2->colors[1])
1439 return true;
1440 else
1441 return false;
1442 }
1443}
1444
1445
1446/*---------------------
1447 * Subroutines for expanding color trigrams into regular trigrams (stage 3).
1448 *---------------------
1449 */
1450
1451/*
1452 * Get vector of all color trigrams in graph and select which of them
1453 * to expand into simple trigrams.
1454 *
1455 * Returns true if OK, false if exhausted resource limits.
1456 */
1457static bool
1459{
1460 HASH_SEQ_STATUS scan_status;
1461 int arcsCount = trgmNFA->arcsCount,
1462 i;
1464 ColorTrgmInfo *colorTrgms;
1465 int64 totalTrgmCount;
1466 float4 totalTrgmPenalty;
1467 int cnumber;
1468
1469 /* Collect color trigrams from all arcs */
1470 colorTrgms = (ColorTrgmInfo *) palloc0(sizeof(ColorTrgmInfo) * arcsCount);
1471 trgmNFA->colorTrgms = colorTrgms;
1472
1473 i = 0;
1474 hash_seq_init(&scan_status, trgmNFA->states);
1475 while ((state = (TrgmState *) hash_seq_search(&scan_status)) != NULL)
1476 {
1477 ListCell *cell;
1478
1479 foreach(cell, state->arcs)
1480 {
1481 TrgmArc *arc = (TrgmArc *) lfirst(cell);
1482 TrgmArcInfo *arcInfo = (TrgmArcInfo *) palloc(sizeof(TrgmArcInfo));
1483 ColorTrgmInfo *trgmInfo = &colorTrgms[i];
1484
1485 arcInfo->source = state;
1486 arcInfo->target = arc->target;
1487 trgmInfo->ctrgm = arc->ctrgm;
1488 trgmInfo->cnumber = -1;
1489 /* count and penalty will be set below */
1490 trgmInfo->expanded = true;
1491 trgmInfo->arcs = list_make1(arcInfo);
1492 i++;
1493 }
1494 }
1495 Assert(i == arcsCount);
1496
1497 /* Remove duplicates, merging their arcs lists */
1498 if (arcsCount >= 2)
1499 {
1500 ColorTrgmInfo *p1,
1501 *p2;
1502
1503 /* Sort trigrams to ease duplicate detection */
1504 qsort(colorTrgms, arcsCount, sizeof(ColorTrgmInfo), colorTrgmInfoCmp);
1505
1506 /* p1 is probe point, p2 is last known non-duplicate. */
1507 p2 = colorTrgms;
1508 for (p1 = colorTrgms + 1; p1 < colorTrgms + arcsCount; p1++)
1509 {
1510 if (colorTrgmInfoCmp(p1, p2) > 0)
1511 {
1512 p2++;
1513 *p2 = *p1;
1514 }
1515 else
1516 {
1517 p2->arcs = list_concat(p2->arcs, p1->arcs);
1518 }
1519 }
1520 trgmNFA->colorTrgmsCount = (p2 - colorTrgms) + 1;
1521 }
1522 else
1523 {
1524 trgmNFA->colorTrgmsCount = arcsCount;
1525 }
1526
1527 /*
1528 * Count number of simple trigrams generated by each color trigram, and
1529 * also compute a penalty value, which is the number of simple trigrams
1530 * times a multiplier that depends on its whitespace content.
1531 *
1532 * Note: per-color-trigram counts cannot overflow an int so long as
1533 * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie about
1534 * 1290. However, the grand total totalTrgmCount might conceivably
1535 * overflow an int, so we use int64 for that within this routine. Also,
1536 * penalties are calculated in float4 arithmetic to avoid any overflow
1537 * worries.
1538 */
1539 totalTrgmCount = 0;
1540 totalTrgmPenalty = 0.0f;
1541 for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
1542 {
1543 ColorTrgmInfo *trgmInfo = &colorTrgms[i];
1544 int j,
1545 count = 1,
1546 typeIndex = 0;
1547
1548 for (j = 0; j < 3; j++)
1549 {
1550 TrgmColor c = trgmInfo->ctrgm.colors[j];
1551
1552 typeIndex *= 2;
1553 if (c == COLOR_BLANK)
1554 typeIndex++;
1555 else
1556 count *= trgmNFA->colorInfo[c].wordCharsCount;
1557 }
1558 trgmInfo->count = count;
1559 totalTrgmCount += count;
1560 trgmInfo->penalty = penalties[typeIndex] * (float4) count;
1561 totalTrgmPenalty += trgmInfo->penalty;
1562 }
1563
1564 /* Sort color trigrams in descending order of their penalties */
1565 qsort(colorTrgms, trgmNFA->colorTrgmsCount, sizeof(ColorTrgmInfo),
1567
1568 /*
1569 * Remove color trigrams from the graph so long as total penalty of color
1570 * trigrams exceeds WISH_TRGM_PENALTY. (If we fail to get down to
1571 * WISH_TRGM_PENALTY, it's OK so long as total count is no more than
1572 * MAX_TRGM_COUNT.) We prefer to remove color trigrams with higher
1573 * penalty, since those are the most promising for reducing the total
1574 * penalty. When removing a color trigram we have to merge states
1575 * connected by arcs labeled with that trigram. It's necessary to not
1576 * merge initial and final states, because our graph becomes useless if
1577 * that happens; so we cannot always remove the trigram we'd prefer to.
1578 */
1579 for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
1580 {
1581 ColorTrgmInfo *trgmInfo = &colorTrgms[i];
1582 bool canRemove = true;
1583 ListCell *cell;
1584
1585 /* Done if we've reached the target */
1586 if (totalTrgmPenalty <= WISH_TRGM_PENALTY)
1587 break;
1588
1589#ifdef TRGM_REGEXP_DEBUG
1590 fprintf(stderr, "considering ctrgm %d %d %d, penalty %f, %d arcs\n",
1591 trgmInfo->ctrgm.colors[0],
1592 trgmInfo->ctrgm.colors[1],
1593 trgmInfo->ctrgm.colors[2],
1594 trgmInfo->penalty,
1595 list_length(trgmInfo->arcs));
1596#endif
1597
1598 /*
1599 * Does any arc of this color trigram connect initial and final
1600 * states? If so we can't remove it.
1601 */
1602 foreach(cell, trgmInfo->arcs)
1603 {
1604 TrgmArcInfo *arcInfo = (TrgmArcInfo *) lfirst(cell);
1605 TrgmState *source = arcInfo->source,
1606 *target = arcInfo->target;
1607 int source_flags,
1608 target_flags;
1609
1610#ifdef TRGM_REGEXP_DEBUG
1611 fprintf(stderr, "examining arc to s%d (%x) from s%d (%x)\n",
1612 -target->snumber, target->flags,
1613 -source->snumber, source->flags);
1614#endif
1615
1616 /* examine parent states, if any merging has already happened */
1617 while (source->parent)
1618 source = source->parent;
1619 while (target->parent)
1620 target = target->parent;
1621
1622#ifdef TRGM_REGEXP_DEBUG
1623 fprintf(stderr, " ... after completed merges: to s%d (%x) from s%d (%x)\n",
1624 -target->snumber, target->flags,
1625 -source->snumber, source->flags);
1626#endif
1627
1628 /* we must also consider merges we are planning right now */
1629 source_flags = source->flags | source->tentFlags;
1630 while (source->tentParent)
1631 {
1632 source = source->tentParent;
1633 source_flags |= source->flags | source->tentFlags;
1634 }
1635 target_flags = target->flags | target->tentFlags;
1636 while (target->tentParent)
1637 {
1638 target = target->tentParent;
1639 target_flags |= target->flags | target->tentFlags;
1640 }
1641
1642#ifdef TRGM_REGEXP_DEBUG
1643 fprintf(stderr, " ... after tentative merges: to s%d (%x) from s%d (%x)\n",
1644 -target->snumber, target_flags,
1645 -source->snumber, source_flags);
1646#endif
1647
1648 /* would fully-merged state have both INIT and FIN set? */
1649 if (((source_flags | target_flags) & (TSTATE_INIT | TSTATE_FIN)) ==
1651 {
1652 canRemove = false;
1653 break;
1654 }
1655
1656 /* ok so far, so remember planned merge */
1657 if (source != target)
1658 {
1659#ifdef TRGM_REGEXP_DEBUG
1660 fprintf(stderr, " ... tentatively merging s%d into s%d\n",
1661 -target->snumber, -source->snumber);
1662#endif
1663 target->tentParent = source;
1664 source->tentFlags |= target_flags;
1665 }
1666 }
1667
1668 /*
1669 * We must reset all the tentFlags/tentParent fields before
1670 * continuing. tentFlags could only have become set in states that
1671 * are the source or parent or tentative parent of one of the current
1672 * arcs; likewise tentParent could only have become set in states that
1673 * are the target or parent or tentative parent of one of the current
1674 * arcs. There might be some overlap between those sets, but if we
1675 * clear tentFlags in target states as well as source states, we
1676 * should be okay even if we visit a state as target before visiting
1677 * it as a source.
1678 */
1679 foreach(cell, trgmInfo->arcs)
1680 {
1681 TrgmArcInfo *arcInfo = (TrgmArcInfo *) lfirst(cell);
1682 TrgmState *source = arcInfo->source,
1683 *target = arcInfo->target;
1684 TrgmState *ttarget;
1685
1686 /* no need to touch previously-merged states */
1687 while (source->parent)
1688 source = source->parent;
1689 while (target->parent)
1690 target = target->parent;
1691
1692 while (source)
1693 {
1694 source->tentFlags = 0;
1695 source = source->tentParent;
1696 }
1697
1698 while ((ttarget = target->tentParent) != NULL)
1699 {
1700 target->tentParent = NULL;
1701 target->tentFlags = 0; /* in case it was also a source */
1702 target = ttarget;
1703 }
1704 }
1705
1706 /* Now, move on if we can't drop this trigram */
1707 if (!canRemove)
1708 {
1709#ifdef TRGM_REGEXP_DEBUG
1710 fprintf(stderr, " ... not ok to merge\n");
1711#endif
1712 continue;
1713 }
1714
1715 /* OK, merge states linked by each arc labeled by the trigram */
1716 foreach(cell, trgmInfo->arcs)
1717 {
1718 TrgmArcInfo *arcInfo = (TrgmArcInfo *) lfirst(cell);
1719 TrgmState *source = arcInfo->source,
1720 *target = arcInfo->target;
1721
1722 while (source->parent)
1723 source = source->parent;
1724 while (target->parent)
1725 target = target->parent;
1726 if (source != target)
1727 {
1728#ifdef TRGM_REGEXP_DEBUG
1729 fprintf(stderr, "merging s%d into s%d\n",
1730 -target->snumber, -source->snumber);
1731#endif
1732 mergeStates(source, target);
1733 /* Assert we didn't merge initial and final states */
1734 Assert((source->flags & (TSTATE_INIT | TSTATE_FIN)) !=
1736 }
1737 }
1738
1739 /* Mark trigram unexpanded, and update totals */
1740 trgmInfo->expanded = false;
1741 totalTrgmCount -= trgmInfo->count;
1742 totalTrgmPenalty -= trgmInfo->penalty;
1743 }
1744
1745 /* Did we succeed in fitting into MAX_TRGM_COUNT? */
1746 if (totalTrgmCount > MAX_TRGM_COUNT)
1747 return false;
1748
1749 trgmNFA->totalTrgmCount = (int) totalTrgmCount;
1750
1751 /*
1752 * Sort color trigrams by colors (will be useful for bsearch in packGraph)
1753 * and enumerate the color trigrams that are expanded.
1754 */
1755 cnumber = 0;
1756 qsort(colorTrgms, trgmNFA->colorTrgmsCount, sizeof(ColorTrgmInfo),
1758 for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
1759 {
1760 if (colorTrgms[i].expanded)
1761 {
1762 colorTrgms[i].cnumber = cnumber;
1763 cnumber++;
1764 }
1765 }
1766
1767 return true;
1768}
1769
1770/*
1771 * Expand selected color trigrams into regular trigrams.
1772 *
1773 * Returns the TRGM array to be passed to the index machinery.
1774 * The array must be allocated in rcontext.
1775 */
1776static TRGM *
1778{
1779 TRGM *trg;
1780 trgm *p;
1781 int i;
1782 TrgmColorInfo blankColor;
1783 trgm_mb_char blankChar;
1784
1785 /* Set up "blank" color structure containing a single zero character */
1786 memset(blankChar.bytes, 0, sizeof(blankChar.bytes));
1787 blankColor.wordCharsCount = 1;
1788 blankColor.wordChars = &blankChar;
1789
1790 /* Construct the trgm array */
1791 trg = (TRGM *)
1792 MemoryContextAllocZero(rcontext,
1793 TRGMHDRSIZE +
1794 trgmNFA->totalTrgmCount * sizeof(trgm));
1795 trg->flag = ARRKEY;
1797 p = GETARR(trg);
1798 for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
1799 {
1800 ColorTrgmInfo *colorTrgm = &trgmNFA->colorTrgms[i];
1801 TrgmColorInfo *c[3];
1802 trgm_mb_char s[3];
1803 int j,
1804 i1,
1805 i2,
1806 i3;
1807
1808 /* Ignore any unexpanded trigrams ... */
1809 if (!colorTrgm->expanded)
1810 continue;
1811
1812 /* Get colors, substituting the dummy struct for COLOR_BLANK */
1813 for (j = 0; j < 3; j++)
1814 {
1815 if (colorTrgm->ctrgm.colors[j] != COLOR_BLANK)
1816 c[j] = &trgmNFA->colorInfo[colorTrgm->ctrgm.colors[j]];
1817 else
1818 c[j] = &blankColor;
1819 }
1820
1821 /* Iterate over all possible combinations of colors' characters */
1822 for (i1 = 0; i1 < c[0]->wordCharsCount; i1++)
1823 {
1824 s[0] = c[0]->wordChars[i1];
1825 for (i2 = 0; i2 < c[1]->wordCharsCount; i2++)
1826 {
1827 s[1] = c[1]->wordChars[i2];
1828 for (i3 = 0; i3 < c[2]->wordCharsCount; i3++)
1829 {
1830 s[2] = c[2]->wordChars[i3];
1831 fillTrgm(p, s);
1832 p++;
1833 }
1834 }
1835 }
1836 }
1837
1838 return trg;
1839}
1840
1841/*
1842 * Convert trigram into trgm datatype.
1843 */
1844static void
1846{
1847 char str[3 * MAX_MULTIBYTE_CHAR_LEN],
1848 *p;
1849 int i,
1850 j;
1851
1852 /* Write multibyte string into "str" (we don't need null termination) */
1853 p = str;
1854
1855 for (i = 0; i < 3; i++)
1856 {
1857 if (s[i].bytes[0] != 0)
1858 {
1859 for (j = 0; j < MAX_MULTIBYTE_CHAR_LEN && s[i].bytes[j]; j++)
1860 *p++ = s[i].bytes[j];
1861 }
1862 else
1863 {
1864 /* Emit a space in place of COLOR_BLANK */
1865 *p++ = ' ';
1866 }
1867 }
1868
1869 /* Convert "str" to a standard trigram (possibly hashing it) */
1870 compact_trigram(ptrgm, str, p - str);
1871}
1872
1873/*
1874 * Merge two states of graph.
1875 */
1876static void
1878{
1879 Assert(state1 != state2);
1880 Assert(!state1->parent);
1881 Assert(!state2->parent);
1882
1883 /* state1 absorbs state2's flags */
1884 state1->flags |= state2->flags;
1885
1886 /* state2, and indirectly all its children, become children of state1 */
1887 state2->parent = state1;
1888}
1889
1890/*
1891 * Compare function for sorting of color trigrams by their colors.
1892 */
1893static int
1894colorTrgmInfoCmp(const void *p1, const void *p2)
1895{
1896 const ColorTrgmInfo *c1 = (const ColorTrgmInfo *) p1;
1897 const ColorTrgmInfo *c2 = (const ColorTrgmInfo *) p2;
1898
1899 return memcmp(&c1->ctrgm, &c2->ctrgm, sizeof(ColorTrgm));
1900}
1901
1902/*
1903 * Compare function for sorting color trigrams in descending order of
1904 * their penalty fields.
1905 */
1906static int
1907colorTrgmInfoPenaltyCmp(const void *p1, const void *p2)
1908{
1909 float4 penalty1 = ((const ColorTrgmInfo *) p1)->penalty;
1910 float4 penalty2 = ((const ColorTrgmInfo *) p2)->penalty;
1911
1912 if (penalty1 < penalty2)
1913 return 1;
1914 else if (penalty1 == penalty2)
1915 return 0;
1916 else
1917 return -1;
1918}
1919
1920
1921/*---------------------
1922 * Subroutines for packing the graph into final representation (stage 4).
1923 *---------------------
1924 */
1925
1926/*
1927 * Pack expanded graph into final representation.
1928 *
1929 * The result data must be allocated in rcontext.
1930 */
1931static TrgmPackedGraph *
1933{
1934 int snumber = 2,
1935 arcIndex,
1936 arcsCount;
1937 HASH_SEQ_STATUS scan_status;
1939 TrgmPackArcInfo *arcs;
1940 TrgmPackedArc *packedArcs;
1941 TrgmPackedGraph *result;
1942 int i,
1943 j;
1944
1945 /* Enumerate surviving states, giving init and fin reserved numbers */
1946 hash_seq_init(&scan_status, trgmNFA->states);
1947 while ((state = (TrgmState *) hash_seq_search(&scan_status)) != NULL)
1948 {
1949 while (state->parent)
1950 state = state->parent;
1951
1952 if (state->snumber < 0)
1953 {
1954 if (state->flags & TSTATE_INIT)
1955 state->snumber = 0;
1956 else if (state->flags & TSTATE_FIN)
1957 state->snumber = 1;
1958 else
1959 {
1960 state->snumber = snumber;
1961 snumber++;
1962 }
1963 }
1964 }
1965
1966 /* Collect array of all arcs */
1967 arcs = (TrgmPackArcInfo *)
1968 palloc(sizeof(TrgmPackArcInfo) * trgmNFA->arcsCount);
1969 arcIndex = 0;
1970 hash_seq_init(&scan_status, trgmNFA->states);
1971 while ((state = (TrgmState *) hash_seq_search(&scan_status)) != NULL)
1972 {
1974 ListCell *cell;
1975
1976 while (source->parent)
1977 source = source->parent;
1978
1979 foreach(cell, state->arcs)
1980 {
1981 TrgmArc *arc = (TrgmArc *) lfirst(cell);
1982 TrgmState *target = arc->target;
1983
1984 while (target->parent)
1985 target = target->parent;
1986
1987 if (source->snumber != target->snumber)
1988 {
1989 ColorTrgmInfo *ctrgm;
1990
1991 ctrgm = (ColorTrgmInfo *) bsearch(&arc->ctrgm,
1992 trgmNFA->colorTrgms,
1993 trgmNFA->colorTrgmsCount,
1994 sizeof(ColorTrgmInfo),
1996 Assert(ctrgm != NULL);
1997 Assert(ctrgm->expanded);
1998
1999 arcs[arcIndex].sourceState = source->snumber;
2000 arcs[arcIndex].targetState = target->snumber;
2001 arcs[arcIndex].colorTrgm = ctrgm->cnumber;
2002 arcIndex++;
2003 }
2004 }
2005 }
2006
2007 /* Sort arcs to ease duplicate detection */
2008 qsort(arcs, arcIndex, sizeof(TrgmPackArcInfo), packArcInfoCmp);
2009
2010 /* We could have duplicates because states were merged. Remove them. */
2011 if (arcIndex > 1)
2012 {
2013 /* p1 is probe point, p2 is last known non-duplicate. */
2014 TrgmPackArcInfo *p1,
2015 *p2;
2016
2017 p2 = arcs;
2018 for (p1 = arcs + 1; p1 < arcs + arcIndex; p1++)
2019 {
2020 if (packArcInfoCmp(p1, p2) > 0)
2021 {
2022 p2++;
2023 *p2 = *p1;
2024 }
2025 }
2026 arcsCount = (p2 - arcs) + 1;
2027 }
2028 else
2029 arcsCount = arcIndex;
2030
2031 /* Create packed representation */
2032 result = (TrgmPackedGraph *)
2033 MemoryContextAlloc(rcontext, sizeof(TrgmPackedGraph));
2034
2035 /* Pack color trigrams information */
2036 result->colorTrigramsCount = 0;
2037 for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
2038 {
2039 if (trgmNFA->colorTrgms[i].expanded)
2040 result->colorTrigramsCount++;
2041 }
2042 result->colorTrigramGroups = (int *)
2043 MemoryContextAlloc(rcontext, sizeof(int) * result->colorTrigramsCount);
2044 j = 0;
2045 for (i = 0; i < trgmNFA->colorTrgmsCount; i++)
2046 {
2047 if (trgmNFA->colorTrgms[i].expanded)
2048 {
2049 result->colorTrigramGroups[j] = trgmNFA->colorTrgms[i].count;
2050 j++;
2051 }
2052 }
2053
2054 /* Pack states and arcs information */
2055 result->statesCount = snumber;
2056 result->states = (TrgmPackedState *)
2057 MemoryContextAlloc(rcontext, snumber * sizeof(TrgmPackedState));
2058 packedArcs = (TrgmPackedArc *)
2059 MemoryContextAlloc(rcontext, arcsCount * sizeof(TrgmPackedArc));
2060 j = 0;
2061 for (i = 0; i < snumber; i++)
2062 {
2063 int cnt = 0;
2064
2065 result->states[i].arcs = &packedArcs[j];
2066 while (j < arcsCount && arcs[j].sourceState == i)
2067 {
2068 packedArcs[j].targetState = arcs[j].targetState;
2069 packedArcs[j].colorTrgm = arcs[j].colorTrgm;
2070 cnt++;
2071 j++;
2072 }
2073 result->states[i].arcsCount = cnt;
2074 }
2075
2076 /* Allocate working memory for trigramsMatchGraph() */
2077 result->colorTrigramsActive = (bool *)
2078 MemoryContextAlloc(rcontext, sizeof(bool) * result->colorTrigramsCount);
2079 result->statesActive = (bool *)
2080 MemoryContextAlloc(rcontext, sizeof(bool) * result->statesCount);
2081 result->statesQueue = (int *)
2082 MemoryContextAlloc(rcontext, sizeof(int) * result->statesCount);
2083
2084 return result;
2085}
2086
2087/*
2088 * Comparison function for sorting TrgmPackArcInfos.
2089 *
2090 * Compares arcs in following order: sourceState, colorTrgm, targetState.
2091 */
2092static int
2093packArcInfoCmp(const void *a1, const void *a2)
2094{
2095 const TrgmPackArcInfo *p1 = (const TrgmPackArcInfo *) a1;
2096 const TrgmPackArcInfo *p2 = (const TrgmPackArcInfo *) a2;
2097
2098 if (p1->sourceState < p2->sourceState)
2099 return -1;
2100 if (p1->sourceState > p2->sourceState)
2101 return 1;
2102 if (p1->colorTrgm < p2->colorTrgm)
2103 return -1;
2104 if (p1->colorTrgm > p2->colorTrgm)
2105 return 1;
2106 if (p1->targetState < p2->targetState)
2107 return -1;
2108 if (p1->targetState > p2->targetState)
2109 return 1;
2110 return 0;
2111}
2112
2113
2114/*---------------------
2115 * Debugging functions
2116 *
2117 * These are designed to emit GraphViz files.
2118 *---------------------
2119 */
2120
2121#ifdef TRGM_REGEXP_DEBUG
2122
2123/*
2124 * Print initial NFA, in regexp library's representation
2125 */
2126static void
2127printSourceNFA(regex_t *regex, TrgmColorInfo *colors, int ncolors)
2128{
2130 int nstates = pg_reg_getnumstates(regex);
2131 int state;
2132 int i;
2133
2135
2136 appendStringInfoString(&buf, "\ndigraph sourceNFA {\n");
2137
2138 for (state = 0; state < nstates; state++)
2139 {
2140 regex_arc_t *arcs;
2141 int i,
2142 arcsCount;
2143
2144 appendStringInfo(&buf, "s%d", state);
2145 if (pg_reg_getfinalstate(regex) == state)
2146 appendStringInfoString(&buf, " [shape = doublecircle]");
2147 appendStringInfoString(&buf, ";\n");
2148
2149 arcsCount = pg_reg_getnumoutarcs(regex, state);
2150 arcs = (regex_arc_t *) palloc(sizeof(regex_arc_t) * arcsCount);
2151 pg_reg_getoutarcs(regex, state, arcs, arcsCount);
2152
2153 for (i = 0; i < arcsCount; i++)
2154 {
2155 appendStringInfo(&buf, " s%d -> s%d [label = \"%d\"];\n",
2156 state, arcs[i].to, arcs[i].co);
2157 }
2158
2159 pfree(arcs);
2160 }
2161
2162 appendStringInfoString(&buf, " node [shape = point ]; initial;\n");
2163 appendStringInfo(&buf, " initial -> s%d;\n",
2164 pg_reg_getinitialstate(regex));
2165
2166 /* Print colors */
2167 appendStringInfoString(&buf, " { rank = sink;\n");
2168 appendStringInfoString(&buf, " Colors [shape = none, margin=0, label=<\n");
2169
2170 for (i = 0; i < ncolors; i++)
2171 {
2172 TrgmColorInfo *color = &colors[i];
2173 int j;
2174
2175 appendStringInfo(&buf, "<br/>Color %d: ", i);
2176 if (color->expandable)
2177 {
2178 for (j = 0; j < color->wordCharsCount; j++)
2179 {
2180 char s[MAX_MULTIBYTE_CHAR_LEN + 1];
2181
2182 memcpy(s, color->wordChars[j].bytes, MAX_MULTIBYTE_CHAR_LEN);
2183 s[MAX_MULTIBYTE_CHAR_LEN] = '\0';
2185 }
2186 }
2187 else
2188 appendStringInfoString(&buf, "not expandable");
2189 appendStringInfoChar(&buf, '\n');
2190 }
2191
2192 appendStringInfoString(&buf, " >];\n");
2193 appendStringInfoString(&buf, " }\n");
2194 appendStringInfoString(&buf, "}\n");
2195
2196 {
2197 /* dot -Tpng -o /tmp/source.png < /tmp/source.gv */
2198 FILE *fp = fopen("/tmp/source.gv", "w");
2199
2200 fprintf(fp, "%s", buf.data);
2201 fclose(fp);
2202 }
2203
2204 pfree(buf.data);
2205}
2206
2207/*
2208 * Print expanded graph.
2209 */
2210static void
2211printTrgmNFA(TrgmNFA *trgmNFA)
2212{
2214 HASH_SEQ_STATUS scan_status;
2216 TrgmState *initstate = NULL;
2217
2219
2220 appendStringInfoString(&buf, "\ndigraph transformedNFA {\n");
2221
2222 hash_seq_init(&scan_status, trgmNFA->states);
2223 while ((state = (TrgmState *) hash_seq_search(&scan_status)) != NULL)
2224 {
2225 ListCell *cell;
2226
2227 appendStringInfo(&buf, "s%d", -state->snumber);
2228 if (state->flags & TSTATE_FIN)
2229 appendStringInfoString(&buf, " [shape = doublecircle]");
2230 if (state->flags & TSTATE_INIT)
2231 initstate = state;
2232 appendStringInfo(&buf, " [label = \"%d\"]", state->stateKey.nstate);
2233 appendStringInfoString(&buf, ";\n");
2234
2235 foreach(cell, state->arcs)
2236 {
2237 TrgmArc *arc = (TrgmArc *) lfirst(cell);
2238
2239 appendStringInfo(&buf, " s%d -> s%d [label = \"",
2240 -state->snumber, -arc->target->snumber);
2241 printTrgmColor(&buf, arc->ctrgm.colors[0]);
2243 printTrgmColor(&buf, arc->ctrgm.colors[1]);
2245 printTrgmColor(&buf, arc->ctrgm.colors[2]);
2246 appendStringInfoString(&buf, "\"];\n");
2247 }
2248 }
2249
2250 if (initstate)
2251 {
2252 appendStringInfoString(&buf, " node [shape = point ]; initial;\n");
2253 appendStringInfo(&buf, " initial -> s%d;\n", -initstate->snumber);
2254 }
2255
2256 appendStringInfoString(&buf, "}\n");
2257
2258 {
2259 /* dot -Tpng -o /tmp/transformed.png < /tmp/transformed.gv */
2260 FILE *fp = fopen("/tmp/transformed.gv", "w");
2261
2262 fprintf(fp, "%s", buf.data);
2263 fclose(fp);
2264 }
2265
2266 pfree(buf.data);
2267}
2268
2269/*
2270 * Print a TrgmColor readably.
2271 */
2272static void
2273printTrgmColor(StringInfo buf, TrgmColor co)
2274{
2275 if (co == COLOR_UNKNOWN)
2277 else if (co == COLOR_BLANK)
2279 else
2280 appendStringInfo(buf, "%d", (int) co);
2281}
2282
2283/*
2284 * Print final packed representation of trigram-based expanded graph.
2285 */
2286static void
2287printTrgmPackedGraph(TrgmPackedGraph *packedGraph, TRGM *trigrams)
2288{
2290 trgm *p;
2291 int i;
2292
2294
2295 appendStringInfoString(&buf, "\ndigraph packedGraph {\n");
2296
2297 for (i = 0; i < packedGraph->statesCount; i++)
2298 {
2299 TrgmPackedState *state = &packedGraph->states[i];
2300 int j;
2301
2302 appendStringInfo(&buf, " s%d", i);
2303 if (i == 1)
2304 appendStringInfoString(&buf, " [shape = doublecircle]");
2305
2306 appendStringInfo(&buf, " [label = <s%d>];\n", i);
2307
2308 for (j = 0; j < state->arcsCount; j++)
2309 {
2310 TrgmPackedArc *arc = &state->arcs[j];
2311
2312 appendStringInfo(&buf, " s%d -> s%d [label = \"trigram %d\"];\n",
2313 i, arc->targetState, arc->colorTrgm);
2314 }
2315 }
2316
2317 appendStringInfoString(&buf, " node [shape = point ]; initial;\n");
2318 appendStringInfo(&buf, " initial -> s%d;\n", 0);
2319
2320 /* Print trigrams */
2321 appendStringInfoString(&buf, " { rank = sink;\n");
2322 appendStringInfoString(&buf, " Trigrams [shape = none, margin=0, label=<\n");
2323
2324 p = GETARR(trigrams);
2325 for (i = 0; i < packedGraph->colorTrigramsCount; i++)
2326 {
2327 int count = packedGraph->colorTrigramGroups[i];
2328 int j;
2329
2330 appendStringInfo(&buf, "<br/>Trigram %d: ", i);
2331
2332 for (j = 0; j < count; j++)
2333 {
2334 if (j > 0)
2336
2337 /*
2338 * XXX This representation is nice only for all-ASCII trigrams.
2339 */
2340 appendStringInfo(&buf, "\"%c%c%c\"", (*p)[0], (*p)[1], (*p)[2]);
2341 p++;
2342 }
2343 }
2344
2345 appendStringInfoString(&buf, " >];\n");
2346 appendStringInfoString(&buf, " }\n");
2347 appendStringInfoString(&buf, "}\n");
2348
2349 {
2350 /* dot -Tpng -o /tmp/packed.png < /tmp/packed.gv */
2351 FILE *fp = fopen("/tmp/packed.gv", "w");
2352
2353 fprintf(fp, "%s", buf.data);
2354 fclose(fp);
2355 }
2356
2357 pfree(buf.data);
2358}
2359
2360#endif /* TRGM_REGEXP_DEBUG */
int64_t int64
Definition: c.h:499
float float4
Definition: c.h:600
#define MemSet(start, val, len)
Definition: c.h:991
#define fprintf(file, fmt, msg)
Definition: cubescan.l:21
void * hash_search(HTAB *hashp, const void *keyPtr, HASHACTION action, bool *foundPtr)
Definition: dynahash.c:955
void * hash_seq_search(HASH_SEQ_STATUS *status)
Definition: dynahash.c:1420
long hash_get_num_entries(HTAB *hashp)
Definition: dynahash.c:1341
HTAB * hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
Definition: dynahash.c:352
void hash_seq_init(HASH_SEQ_STATUS *status, HTAB *hashp)
Definition: dynahash.c:1385
int errcode(int sqlerrcode)
Definition: elog.c:854
int errmsg(const char *fmt,...)
Definition: elog.c:1071
#define ERROR
Definition: elog.h:39
#define ereport(elevel,...)
Definition: elog.h:149
char * str_tolower(const char *buff, size_t nbytes, Oid collid)
Definition: formatting.c:1637
Assert(PointerIsAligned(start, uint64))
const char * str
static const FormData_pg_attribute a1
Definition: heap.c:144
static const FormData_pg_attribute a2
Definition: heap.c:157
@ HASH_ENTER
Definition: hsearch.h:114
#define HASH_CONTEXT
Definition: hsearch.h:102
#define HASH_ELEM
Definition: hsearch.h:95
#define HASH_BLOBS
Definition: hsearch.h:97
#define CALCGTSIZE(flag, siglen)
Definition: hstore_gist.c:60
int j
Definition: isn.c:78
int i
Definition: isn.c:77
List * lappend(List *list, void *datum)
Definition: list.c:339
List * list_concat(List *list1, const List *list2)
Definition: list.c:561
void list_free(List *list)
Definition: list.c:1546
unsigned int pg_wchar
Definition: mbprint.c:31
int pg_wchar2mb_with_len(const pg_wchar *from, char *to, int len)
Definition: mbutils.c:1008
int pg_mb2wchar_with_len(const char *from, pg_wchar *to, int len)
Definition: mbutils.c:986
void * MemoryContextAlloc(MemoryContext context, Size size)
Definition: mcxt.c:1260
void * MemoryContextAllocZero(MemoryContext context, Size size)
Definition: mcxt.c:1294
void pfree(void *pointer)
Definition: mcxt.c:2150
void * palloc0(Size size)
Definition: mcxt.c:1973
void * palloc(Size size)
Definition: mcxt.c:1943
MemoryContext CurrentMemoryContext
Definition: mcxt.c:159
void MemoryContextDelete(MemoryContext context)
Definition: mcxt.c:485
#define AllocSetContextCreate
Definition: memutils.h:149
#define ALLOCSET_DEFAULT_SIZES
Definition: memutils.h:180
static MemoryContext MemoryContextSwitchTo(MemoryContext context)
Definition: palloc.h:124
#define lfirst(lc)
Definition: pg_list.h:172
static int list_length(const List *l)
Definition: pg_list.h:152
#define NIL
Definition: pg_list.h:68
#define foreach_delete_current(lst, var_or_cell)
Definition: pg_list.h:391
#define list_make1(x1)
Definition: pg_list.h:212
static rewind_source * source
Definition: pg_rewind.c:89
static char * buf
Definition: pg_test_fsync.c:72
#define MAX_MULTIBYTE_CHAR_LEN
Definition: pg_wchar.h:33
#define qsort(a, b, c, d)
Definition: port.h:479
unsigned int Oid
Definition: postgres_ext.h:30
char * c
int pg_regcomp(regex_t *re, const chr *string, size_t len, int flags, Oid collation)
Definition: regcomp.c:372
size_t pg_regerror(int errcode, const regex_t *preg, char *errbuf, size_t errbuf_size)
Definition: regerror.c:60
#define REG_ICASE
Definition: regex.h:184
#define REG_ADVANCED
Definition: regex.h:181
#define REG_OKAY
Definition: regex.h:215
#define REG_NOSUB
Definition: regex.h:185
#define regex_t
Definition: regex.h:245
int pg_reg_getnumoutarcs(const regex_t *regex, int st)
Definition: regexport.c:134
int pg_reg_getnumcolors(const regex_t *regex)
Definition: regexport.c:174
int pg_reg_getnumstates(const regex_t *regex)
Definition: regexport.c:36
int pg_reg_getfinalstate(const regex_t *regex)
Definition: regexport.c:64
int pg_reg_getinitialstate(const regex_t *regex)
Definition: regexport.c:50
void pg_reg_getoutarcs(const regex_t *regex, int st, regex_arc_t *arcs, int arcs_len)
Definition: regexport.c:155
int pg_reg_colorisend(const regex_t *regex, int co)
Definition: regexport.c:208
void pg_reg_getcharacters(const regex_t *regex, int co, pg_wchar *chars, int chars_len)
Definition: regexport.c:266
int pg_reg_getnumcharacters(const regex_t *regex, int co)
Definition: regexport.c:230
int pg_reg_colorisbegin(const regex_t *regex, int co)
Definition: regexport.c:191
short color
Definition: regguts.h:155
void appendStringInfo(StringInfo str, const char *fmt,...)
Definition: stringinfo.c:145
void appendStringInfoString(StringInfo str, const char *s)
Definition: stringinfo.c:230
void appendStringInfoChar(StringInfo str, char ch)
Definition: stringinfo.c:242
void initStringInfo(StringInfo str)
Definition: stringinfo.c:97
ColorTrgm ctrgm
Definition: trgm_regexp.c:377
float4 penalty
Definition: trgm_regexp.c:380
TrgmColor colors[3]
Definition: trgm_regexp.c:302
Size keysize
Definition: hsearch.h:75
Size entrysize
Definition: hsearch.h:76
MemoryContext hcxt
Definition: hsearch.h:86
Definition: dynahash.c:220
Definition: pg_list.h:54
Definition: trgm.h:58
uint8 flag
Definition: trgm.h:60
TrgmState * source
Definition: trgm_regexp.c:362
TrgmState * target
Definition: trgm_regexp.c:363
ColorTrgm ctrgm
Definition: trgm_regexp.c:351
TrgmState * target
Definition: trgm_regexp.c:352
trgm_mb_char * wordChars
Definition: trgm_regexp.c:266
bool containsNonWord
Definition: trgm_regexp.c:264
bool overflowed
Definition: trgm_regexp.c:418
int nstates
Definition: trgm_regexp.c:412
int arcsCount
Definition: trgm_regexp.c:417
ColorTrgmInfo * colorTrgms
Definition: trgm_regexp.c:421
List * queue
Definition: trgm_regexp.c:415
int colorTrgmsCount
Definition: trgm_regexp.c:422
TrgmColorInfo * colorInfo
Definition: trgm_regexp.c:406
int totalTrgmCount
Definition: trgm_regexp.c:423
List * keysQueue
Definition: trgm_regexp.c:416
int ncolors
Definition: trgm_regexp.c:407
TrgmState * initState
Definition: trgm_regexp.c:411
regex_t * regex
Definition: trgm_regexp.c:405
HTAB * states
Definition: trgm_regexp.c:410
TrgmPackedState * states
Definition: trgm_regexp.c:461
bool * statesActive
Definition: trgm_regexp.c:465
bool * colorTrigramsActive
Definition: trgm_regexp.c:464
int * colorTrigramGroups
Definition: trgm_regexp.c:454
TrgmPackedArc * arcs
Definition: trgm_regexp.c:438
TrgmColor colors[2]
Definition: trgm_regexp.c:293
TrgmPrefix prefix
Definition: trgm_regexp.c:312
List * arcs
Definition: trgm_regexp.c:337
List * enterKeys
Definition: trgm_regexp.c:338
struct TrgmState * parent
Definition: trgm_regexp.c:341
TrgmStateKey stateKey
Definition: trgm_regexp.c:336
int tentFlags
Definition: trgm_regexp.c:342
struct TrgmState * tentParent
Definition: trgm_regexp.c:343
Definition: regguts.h:296
color co
Definition: regguts.h:298
struct state * to
Definition: regguts.h:300
Definition: regguts.h:323
char bytes[MAX_MULTIBYTE_CHAR_LEN]
Definition: trgm_regexp.c:245
Definition: c.h:658
#define ISWORDCHR(c)
Definition: trgm.h:50
void compact_trigram(trgm *tptr, char *str, int bytelen)
Definition: trgm_op.c:248
char trgm[3]
Definition: trgm.h:41
#define GETARR(x)
Definition: trgm.h:97
#define ARRKEY
Definition: trgm.h:87
#define TRGMHDRSIZE
Definition: trgm.h:64
int TrgmColor
Definition: trgm_regexp.c:285
static bool convertPgWchar(pg_wchar c, trgm_mb_char *result)
Definition: trgm_regexp.c:829
static bool validArcLabel(TrgmStateKey *key, TrgmColor co)
Definition: trgm_regexp.c:1330
static TrgmState * getState(TrgmNFA *trgmNFA, TrgmStateKey *key)
Definition: trgm_regexp.c:1385
static void getColorInfo(regex_t *regex, TrgmNFA *trgmNFA)
Definition: trgm_regexp.c:765
#define COLOR_BLANK
Definition: trgm_regexp.c:289
static TRGM * createTrgmNFAInternal(regex_t *regex, TrgmPackedGraph **graph, MemoryContext rcontext)
Definition: trgm_regexp.c:567
#define TSTATE_INIT
Definition: trgm_regexp.c:331
static void addArc(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key, TrgmColor co, TrgmStateKey *destKey)
Definition: trgm_regexp.c:1288
#define MAX_EXPANDED_STATES
Definition: trgm_regexp.c:221
static int colorTrgmInfoPenaltyCmp(const void *p1, const void *p2)
Definition: trgm_regexp.c:1907
#define MAX_TRGM_COUNT
Definition: trgm_regexp.c:223
#define MAX_EXPANDED_ARCS
Definition: trgm_regexp.c:222
struct TrgmState TrgmState
static bool selectColorTrigrams(TrgmNFA *trgmNFA)
Definition: trgm_regexp.c:1458
static void fillTrgm(trgm *ptrgm, trgm_mb_char s[3])
Definition: trgm_regexp.c:1845
static TRGM * expandColorTrigrams(TrgmNFA *trgmNFA, MemoryContext rcontext)
Definition: trgm_regexp.c:1777
TRGM * createTrgmNFA(text *text_re, Oid collation, TrgmPackedGraph **graph, MemoryContext rcontext)
Definition: trgm_regexp.c:524
static void transformGraph(TrgmNFA *trgmNFA)
Definition: trgm_regexp.c:895
bool trigramsMatchGraph(TrgmPackedGraph *graph, bool *check)
Definition: trgm_regexp.c:628
static const float4 penalties[8]
Definition: trgm_regexp.c:231
static void addKeyToQueue(TrgmNFA *trgmNFA, TrgmStateKey *key)
Definition: trgm_regexp.c:1178
static void addKey(TrgmNFA *trgmNFA, TrgmState *state, TrgmStateKey *key)
Definition: trgm_regexp.c:1009
static TrgmPackedGraph * packGraph(TrgmNFA *trgmNFA, MemoryContext rcontext)
Definition: trgm_regexp.c:1932
static int packArcInfoCmp(const void *a1, const void *a2)
Definition: trgm_regexp.c:2093
static void processState(TrgmNFA *trgmNFA, TrgmState *state)
Definition: trgm_regexp.c:959
#define COLOR_COUNT_LIMIT
Definition: trgm_regexp.c:225
#define TSTATE_FIN
Definition: trgm_regexp.c:332
#define WISH_TRGM_PENALTY
Definition: trgm_regexp.c:224
static void mergeStates(TrgmState *state1, TrgmState *state2)
Definition: trgm_regexp.c:1877
#define COLOR_UNKNOWN
Definition: trgm_regexp.c:288
static int colorTrgmInfoCmp(const void *p1, const void *p2)
Definition: trgm_regexp.c:1894
static void RE_compile(regex_t *regex, text *text_re, int cflags, Oid collation)
Definition: trgm_regexp.c:721
static void addArcs(TrgmNFA *trgmNFA, TrgmState *state)
Definition: trgm_regexp.c:1190
static bool prefixContains(TrgmPrefix *prefix1, TrgmPrefix *prefix2)
Definition: trgm_regexp.c:1416
#define VARDATA_ANY(PTR)
Definition: varatt.h:324
#define SET_VARSIZE(PTR, len)
Definition: varatt.h:305
#define VARSIZE_ANY_EXHDR(PTR)
Definition: varatt.h:317
static char chars[TZ_MAX_CHARS]
Definition: zic.c:401