source: trunk/SRC/ToBeReviewed/MATRICE/cmset_op.pro @ 114

Last change on this file since 114 was 114, checked in by smasson, 18 years ago

new compilation options (compile_opt idl2, strictarrsubs) in each routine

  • Property svn:executable set to *
File size: 16.3 KB
Line 
1;+
2; NAME:
3;   CMSET_OP
4;
5; AUTHOR:
6;   Craig B. Markwardt, NASA/GSFC Code 662, Greenbelt, MD 20770
7;   craigm@lheamail.gsfc.nasa.gov
8;
9; PURPOSE:
10;   Performs an AND, OR, or XOR operation between two sets
11;
12; CALLING SEQUENCE:
13;   SET      = CMSET_OP(A, OP, B)
14;
15; DESCRIPTION:
16;
17;   SET_OP performs three common operations between two sets.  The
18;   three supported functions of OP are:
19;
20;        OP          Meaning
21;      'AND' - to find the intersection of A and B;
22;      'OR'  - to find the union of A and B;
23;      'XOR' - to find the those elements who are members of A or B
24;              but not both;
25;
26;   Sets as defined here are one dimensional arrays composed of
27;   numeric or string types.  Comparisons of equality between elements
28;   are done using the IDL EQ operator.
29;
30;   The complements of either set can be taken as well, by using the
31;   NOT1 and NOT2 keywords.  For example, it may be desireable to find
32;   the elements in A but not B, or B but not A (they are different!).
33;   The following IDL expressions achieve each of those effects:
34;
35;      SET = CMSET_OP(A, 'AND', /NOT2, B)   ; A but not B
36;      SET = CMSET_OP(/NOT1, A, 'AND', B)   ; B but not A
37;
38;   Note the distinction between NOT1 and NOT2.  NOT1 refers to the
39;   first set (A) and NOT2 refers to the second (B).  Their ordered
40;   placement in the calling sequence is entirely optional, but the
41;   above ordering makes the logical meaning explicit.
42;
43;   NOT1 and NOT2 can only be set for the 'AND' operator, and never
44;   simultaneously.  This is because the results of an operation with
45;   'OR' or 'XOR' and any combination of NOTs -- or with 'AND' and
46;   both NOTs -- formally cannot produce a defined result.
47;
48;   The implementation depends on the type of operands.  For integer
49;   types, a fast technique using HISTOGRAM is used.  However, this
50;   algorithm becomes inefficient when the dynamic range in the data
51;   is large.  For those cases, and for other data types, a technique
52;   based on SORT() is used.  Thus the compute time should scale
53;   roughly as (A+B)*ALOG(A+B) or better, rather than (A*B) for the
54;   brute force approach.  For large arrays this is a significant
55;   benefit.
56;
57; INPUTS:
58;
59;   A, B - the two sets to be operated on.  A one dimensional array of
60;          either numeric or string type.  A and B must be of the same
61;          type.  Empty sets are permitted, and are either represented
62;          as an undefined variable, or by setting EMPTY1 or EMPTY2.
63;
64;   OP - a string, the operation to be performed.  Must be one of
65;        'AND', 'OR' or 'XOR' (lower or mixed case is permitted).
66;        Other operations will cause an error message to be produced.
67;
68; KEYWORDS:
69;
70;   NOT1, NOT2 - if set and OP is 'AND', then the complement of A (for
71;                NOT1) or B (for NOT2) will be used in the operation.
72;                NOT1 and NOT2 cannot be set simultaneously.
73;
74;   EMPTY1, EMPTY2 - if set, then A (for EMPTY1) or B (for EMPTY2) are
75;                    assumed to be the empty set.  The actual values
76;                    passed as A or B are then ignored.
77;
78;   INDEX - if set, then return a list of indices instead of the array
79;           values themselves.  The "slower" set operations are always
80;           performed in this case.
81;
82;           The indices refer to the *combined* array [A,B].  To
83;           clarify, in the following call: I = CMSET_OP(..., /INDEX);
84;           returned values from 0 to NA-1 refer to A[I], and values
85;           from NA to NA+NB-1 refer to B[I-NA].
86;
87;   COUNT - upon return, the number of elements in the result set.
88;           This is only important when the result set is the empty
89;           set, in which case COUNT is set to zero.
90;
91; RETURNS:
92;
93;   The resulting set as a one-dimensional array.  The set may be
94;   represented by either an array of data values (default), or an
95;   array of indices (if INDEX is set).  Duplicate elements, if any,
96;   are removed, and element order may not be preserved.
97;
98;   The empty set is represented as a return value of -1L, and COUNT
99;   is set to zero.  Note that the only way to recognize the empty set
100;   is to examine COUNT.
101;
102; SEE ALSO:
103;
104;   SET_UTILS.PRO by RSI
105;
106; MODIFICATION HISTORY:
107;   Written, CM, 23 Feb 2000
108;   Added empty set capability, CM, 25 Feb 2000
109;   Documentation clarification, CM 02 Mar 2000
110;   Incompatible but more consistent reworking of EMPTY keywords, CM,
111;     04 Mar 2000
112;   Minor documentation clarifications, CM, 26 Mar 2000
113;   Corrected bug in empty_arg special case, CM 06 Apr 2000
114;   Add INDEX keyword, CM 31 Jul 2000
115;   Clarify INDEX keyword documentation, CM 06 Sep 2000
116;   Made INDEX keyword always force SLOW_SET_OP, CM 06 Sep 2000
117;   Added CMSET_OP_UNIQ, and ability to select FIRST_UNIQUE or
118;     LAST_UNIQUE values, CM, 18 Sep 2000
119;   Removed FIRST_UNIQUE and LAST_UNIQUE, and streamlined
120;     CMSET_OP_UNIQ until problems with SORT can be understood, CM, 20
121;     Sep 2000 (thanks to Ben Tupper)
122;   Still trying to get documentation of INDEX and NOT right, CM, 28
123;     Sep 2000 (no code changes)
124;   Correct bug for AND case, when input sets A and B each only have
125;     one unique value, and the values are equal.  CM, 04 Mar 2004
126;     (thanks to James B. jbattat at cfa dot harvard dot edu)
127;   Add support for the cases where the input data types are mixed,
128;      but still compatible; also, attempt to return the same data
129;      type that was passed in; CM, 05 Feb 2005
130;   Fix bug in type checking (thanks to "marit"), CM, 10 Dec 2005
131;   Work around a stupidity in the built-in IDL HISTOGRAM routine,
132;      which tries to "help" you by restricting the MIN/MAX to the
133;      range of the input variable (thanks to Will Maddox), CM, 16 Jan 2006
134;
135;  $Id: cmset_op.pro,v 1.6 2006/01/16 19:45:22 craigm Exp $
136;
137;-
138; Copyright (C) 2000, 2004, 2005, 2006, Craig Markwardt
139; This software is provided as is without any warranty whatsoever.
140; Permission to use, copy, modify, and distribute modified or
141; unmodified copies is granted, provided this copyright and disclaimer
142; are included unchanged.
143;-
144
145;; Utility function, similar to UNIQ, but allowing choice of taking
146;; first or last unique element, or non-unique elements.
147;; Unfortunately this doesn't work because of implementation dependent
148;; versions of the SORT() function.
149
150; function cmset_op_uniq, a, first=first, non=non, count=ct, sort=sortit
151;   if n_elements(a) LE 1 then return, 0L
152;   sh = (2L*keyword_set(first)-1L)*(-2L*keyword_set(non)+1)
153;
154;   if keyword_set(sortit) then begin
155;       ;; Sort it manually
156;       ii = sort(a) & b = a[ii]
157;       if keyword_set(non) then wh = where(b EQ shift(b, sh), ct) $
158;       else                     wh = where(b NE shift(b, sh), ct)
159;       if ct GT 0 then return, ii[wh]
160;   endif else begin
161;       ;; Use the user's values directly
162;       if keyword_set(non) then wh = where(a EQ shift(a, sh), ct) $
163;       else                     wh = where(a NE shift(a, sh), ct)
164;       if ct GT 0 then return, wh
165;   endelse
166;
167;   if keyword_set(first) then return, 0L else return, n_elements(a)-1
168; end
169
170;; Simplified version of CMSET_OP_UNIQ which sorts, and takes the
171;; "first" value, whatever that may mean.
172function cmset_op_uniq, a
173;
174  compile_opt idl2, strictarrsubs
175;
176  if n_elements(a) LE 1 then return, 0L
177
178  ii = sort(a) & b = a[ii]
179  wh = where(b NE shift(b, +1L), ct)
180  if ct GT 0 then return, ii[wh]
181
182  return, 0L
183end
184
185function cmset_op, a, op0, b, not1=not1, not2=not2, count=count, $
186              empty1=empty1, empty2=empty2, maxarray=ma, index=index
187;
188  compile_opt idl2, strictarrsubs
189;
190
191  on_error, 2 ;; return on error
192  count = 0L
193  index0 = -1L
194  ;; Histogram technique is used for array sizes < 32,000 elements
195  if n_elements(ma) EQ 0 then ma = 32L*1024L
196
197  ;; Check the number of arguments
198  if n_params() LT 3 then begin
199      ARG_ERR:
200      message, 'USAGE: SET = CMSET_OP(A, OP, B [, COUNT=ct])', /info
201      message, '  KEYWORDS: /NOT1, /NOT2, /EMPTY1, /EMPTY2, INDEX', /info
202      return, -1L
203  endif
204  if n_elements(op0) EQ 0 then goto, ARG_ERR
205  kind = keyword_set(index)
206  fst = 1L
207  if keyword_set(last) then fst = 0L
208  if keyword_set(first) then fst = 1L
209
210  ;; Check the operation
211  sz = size(op0)
212  if sz[sz[0]+1] NE 7 then begin
213      OP_ERR:
214      message, "ERROR: OP must be 'AND', 'OR' or 'XOR'"
215      return, -1L
216  endif
217  op = strupcase(op0)
218  if op NE 'AND' AND op NE 'OR' AND op NE 'XOR' then goto, OP_ERR
219
220  ;; Check NOT1 and NOT2
221  if keyword_set(not1) AND keyword_set(not2) then begin
222      message, "ERROR: NOT1 and NOT2 cannot be set simultaneously"
223      return, -1L
224  endif
225  if (keyword_set(not1) OR keyword_set(not2)) AND $
226    (op EQ 'OR' OR op EQ 'XOR') then begin
227      message, "ERROR: NOT1 and NOT2 cannot be set with 'OR' or 'XOR'"
228      return, -1L
229  endif
230
231  ;; Special cases for empty set
232  n1 = n_elements(a) & n2 = n_elements(b)
233  if keyword_set(empty1) then n1 = 0L
234  if keyword_set(empty2) then n2 = 0L
235  if n1 EQ 0 OR n2 EQ 0 then begin
236      ;; Eliminate duplicates
237      if n1 GT 0 then a1 = cmset_op_uniq(a)
238      if n2 GT 0 then b1 = cmset_op_uniq(b)
239      n1 = n_elements(a1) < n1 & n2 = n_elements(b1) < n2
240      case op of
241          'OR': if n1 EQ 0 then goto, RET_A1 else goto, RET_B1
242         'XOR': if n1 EQ 0 then goto, RET_B1 else goto, RET_A1
243         'AND': begin
244             if keyword_set(not1) AND n1 EQ 0 then goto, RET_B1
245             if keyword_set(not2) AND n2 EQ 0 then goto, RET_A1
246             return, -1L
247         end
248     endcase
249     return, -1L
250     RET_A1:
251     count = n1
252     if kind then begin
253         if count GT 0 then return, a1 else return, -1L
254     endif
255     if count GT 0 then return, a[a1] else return, -1L
256     RET_B1:
257     count = n2
258     if kind then begin
259         if count GT 0 then return, b1+n1 else return, -1L
260     endif         
261     if count GT 0 then return, b[b1] else return, -1L
262 endif
263
264  ;; Allow data to have different types, but they must be at least of
265  ;; the same "base" type.  That is, you can't combine a number with a
266  ;; string, etc.
267  ;; basetype 0:undefined 1:real number 6:complex number 7:string
268  ;;     8:structure 10:pointer 11:object
269
270  ;;          0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
271  basetype = [0, 1, 1, 1, 1, 1, 6, 7, 8, 6,10,11, 1, 1, 1, 1]
272
273  ;; Check types of operands
274  sz1 = size(a) & tp1 = sz1[sz1[0]+1]
275  sz2 = size(b) & tp2 = sz2[sz2[0]+1]
276  if tp1 LT 0 OR tp1 GE 16 OR tp2 LT 0 OR tp2 GE 16 then begin
277      message, 'ERROR: unrecognized data types for operands'
278      return, -1
279  endif
280  if basetype[tp1] NE basetype[tp2] then begin
281      TYPE1_ERR:
282      message, 'ERROR: both A and B must be of the same type'
283      return, -1L
284  endif
285  if tp1 EQ 8 OR tp1 EQ 10 OR tp1 EQ 11 then begin
286      TYPE2_ERR:
287      message, 'ERROR: operands must be a numeric or string type'
288      return, -1L
289  endif
290
291  ;; Now use two different kinds of algorithms: a slower but more
292  ;; general algorithm for generic types, and the histogram technique
293  ;; for integer types.  Even for integer types, if there is too much
294  ;; dynamic range, then the slow method is used.
295
296  if tp1 GE 4 AND tp1 LE 9 then begin
297      ;; String and real types, or large int arrays
298      SLOW_SET_OP:
299      case op of
300          'OR': begin
301              uu = [a,b]    ;; OR is simple; just take unique values
302              index0 = cmset_op_uniq(uu)
303              count = n_elements(index0)
304              if kind then return, index0
305              return, uu[index0]
306          end
307
308          'XOR': begin
309              ;; Make ordered list of set union
310              ai = cmset_op_uniq(a) & na = n_elements(ai)
311              bi = cmset_op_uniq(b) & nb = n_elements(bi)
312              ui = [ai, bi+n1]
313              uu = [a,b]    & uu = uu[ui] ;; Raw union...
314              us = sort(uu) & uu = uu[us] ;; ...and sort
315              if kind then ui = ui[temporary(us)] else ui = 0
316
317              ;; Values in one set only will not have duplicates
318              wh1 = where(uu NE shift(uu, -1), count1)
319              if count1 EQ 0 then return, -1L
320              wh = where(wh1[1:*]-wh1 EQ 1, count)
321              if wh1[0] EQ 0 then begin
322                  if count GT 0 then wh = [-1L, wh] else wh = [-1L]
323                  count = n_elements(wh)
324              endif
325              if count EQ 0 then return, -1
326              if kind then return, ui[wh1[wh+1]]
327              return, uu[wh1[wh+1]]
328          end
329
330          'AND': begin
331              ;; Make ordered list of set union
332              ai = cmset_op_uniq(a) & na = n_elements(ai)
333              bi = cmset_op_uniq(b) & nb = n_elements(bi)
334              ui = [ai, bi+n1]
335              uu = [a,b]    & uu = uu[ui]  ;; Raw union...
336              us = sort(uu) & uu = uu[us]  ;; ...and sort
337              if kind then ui = ui[us] else ui = 0
338
339              if NOT keyword_set(not1) AND NOT keyword_set(not2) then begin
340
341                  ;; Special case: if there are one in each set, and
342                  ;; they are equal, then the SHIFT() technique below
343                  ;; fails.  Do this one by hand.
344                  if na EQ 1 AND nb EQ 1 AND uu[0] EQ uu[1] then begin
345                      count = 1L
346                      if kind then return, 0L
347                      return, [uu[0]]
348                  endif
349
350                  ;; If neither "NOT" is set, then find duplicates
351                  us = 0L  ;; Save memory
352                  wh = where(uu EQ shift(uu, -1L), count) ;; Find non-unique
353                  if count EQ 0 then return, -1L
354                  ;; This should always select the element from A
355                  ;; rather than B (the smaller of the two)
356                  if kind then return, (ui[wh] < ui[wh+1])
357                  return, uu[wh]
358              endif
359
360              ;; For "NOT" cases, we need to identify by set
361              ii = make_array(na+nb, value=1b)
362              if keyword_set(not1) then ii[0:na-1] = 0
363              if keyword_set(not2) then ii[na:*]   = 0
364              ii = ii[temporary(us)]
365
366              ;; Remove any duplicates
367              wh1 = where(uu EQ shift(uu, -1L), count1) ;; Find non-unique
368              if count1 GT 0 then ii[wh1, wh1+1] = 0
369              ;; Remainder is the desired set
370              wh = where(ii, count)
371              if count EQ 0 then return, -1L
372              if kind then return, ui[wh]
373              return, uu[wh]
374          end
375
376      endcase
377      return, -1L  ;; DEFAULT CASE
378  endif else begin
379
380      ;; INDEX keyword forces the "slow" operation
381      if kind then goto, SLOW_SET_OP
382
383      ;; Integer types - use histogram technique if the data range
384      ;; is small enough, otherwise use the "slow" technique above
385      min1 = min(a, max=max1) & min2 = min(b, max=max2)
386      minn = min1 < min2 & maxx = max1 > max2
387      nbins = maxx-minn+1
388      if (maxx-minn) GT floor(ma[0]) then goto, SLOW_SET_OP
389
390      ;; Work around a stupidity in the built-in IDL HISTOGRAM routine
391      if (tp1 EQ 2 OR tp2 EQ 2) AND (minn LT -32768 OR maxx GT 32767) then $
392        goto, SLOW_SET_OP
393
394      ;; Following operations create a histogram of the integer values.
395      ha = histogram(a, min=minn, max=maxx) < 1
396      hb = histogram(b, min=minn, max=maxx) < 1
397
398      ;; Compute NOT cases
399      if keyword_set(not1) then ha = 1b - ha 
400      if keyword_set(not2) then hb = 1b - hb
401      case op of
402          ;; Boolean operations
403          'AND': mask = temporary(ha) AND temporary(hb)
404           'OR': mask = temporary(ha)  OR temporary(hb)
405          'XOR': mask = temporary(ha) XOR temporary(hb)
406      endcase
407
408      wh = where(temporary(mask), count)
409      if count EQ 0 then return, -1L
410     
411      result = temporary(wh+minn)
412      if tp1 NE tp2 then return, result
413      szr = size(result) & tpr = szr[szr[0]+1]
414
415      ;; Cast to the original type if necessary
416      if tpr NE tp1 then begin
417          fresult = make_array(n_elements(result), type=tp1)
418          fresult[0] = temporary(result)
419          result = temporary(fresult)
420      endif
421
422      return, result
423
424  endelse
425
426  return, -1L  ;; DEFAULT CASE
427end
428
429;     Here is how I did the INDEX stuff with fast histogramming.  It
430;     works, but is complicated, so I forced it to go to SLOW_SET_OP.
431;     ha = histogram(a, min=minn, max=maxx, reverse=ra) < 1
432;     rr = ra[0:nbins] & mask = rr NE rr[1:*] & ra = ra[rr]*mask-1L+mask
433;     hb = histogram(b, min=minn, max=maxx, reverse=rb) < 1
434;     rr = rb[0:nbins] & mask = rr NE rr[1:*] & rb = rb[rr]*mask-1L+mask
435;     ...  AND/OR/XOR NOT masking here ...
436;     ra = ra[wh] & rb = rb[wh]
437;     return, ra*(ra GE 0) + (rb+n1)*(ra LT 0) ;; is last 'ra' right?
438
Note: See TracBrowser for help on using the repository browser.