Solved: Re: SAS Most efficient way to eliminate duplicate - Page 2

Aboiron · Posted 10-08-2014 10:01 AM

you are right in the fact i want to nodupkey by identifier -it is what i do for now.

I can indeed skip a table (it can also be the first one, just sort it and keep it), but I do not know what is best in terms of time.

scmebu · Posted 10-08-2014 10:17 AM

Because your source data is in a DBMS, it might be most efficient to perform the processing there. You can write your own SQL query and submit it using PROC SQL's explicit pass-through capability. If you're not quite sure how to write that query, you can look to PROC SORT for an example. When the NODUPKEY option is specified, PROC SORT can generate and submit SQL to a DBMS to perform the work. Obviously, the sorting BY identifier DESCENDING declaration with NODUPKEY isn't what you want but the SQL it generates can serve as your starting point. The flavor of SQL generated depends upon the DBMS being used. You can print the generated SQL to the log by specifying the SQL_IP_TRACE=SOURCE option.

Here's an example:

libname dbms ... ; /* Modify as appropriate for your DBMS */

proc delete data=dbms.have;
run;

data dbms.have;
 length identifier declaration $ 10;
 input identifier declaration;
 cards;
Y 23456C
Y 23456B
Y 23456A
X 123456A
X 123456B
;
run;

options msglevel=i;
options sql_ip_trace=source;

/*************************************************** 
  Use the SQL generated here as a starting point 
 ***************************************************/
proc sort data=dbms.have out=have2 nodupkey;
 by identifier descending declaration;
run;

proc print data=have2;
run;

/*************************************************** 
  For example, we take the generated SQL and submit 
  it to the DBMS using explicit pass-through.  This
  does not quite do what we want, though.
 ***************************************************/
proc sql nowarn;
 drop table have2;
 connect using dbms as db;
 create table have2 as select * from connection to db
 (
   WITH "subquery0" ( "declaration", "identifier" ) AS ( 
     SELECT 
       "declaration" AS "declaration", 
       "identifier" AS "identifier" 
     FROM "have" 
   ) 
   SELECT 
      "table0"."identifier",
      "table0"."declaration" 
   FROM ( 
     SELECT "declaration", "identifier" 
     FROM ( 
       SELECT 
         "declaration", 
         "identifier", 
         ROW_NUMBER() OVER ( 
           PARTITION BY "identifier", "declaration"
           ORDER BY "identifier", "declaration" DESC  
         ) AS "tempcol0" 
     FROM  "subquery0" 
     ) AS "subquery1" 
     WHERE ( "tempcol0" = 1 )  
   )  AS "table0"  
   ORDER BY 
     "table0"."identifier", 
     "table0"."declaration" DESC
 );
 disconnect from db;
quit;

proc print data=have2;
run;

/*************************************************** 
  But we can modify the SQL, removing the 
  declaration column from the PARTITION BY clause,
  which should produce what we want
 ***************************************************/
proc sql nowarn;
 drop table have2;
 connect using dbms as db;
 create table have2 as select * from connection to db
 (
   WITH "subquery0" ( "declaration", "identifier" ) AS ( 
     SELECT 
       "declaration" AS "declaration", 
       "identifier" AS "identifier" 
     FROM "have" 
   ) 
   SELECT 
      "table0"."identifier",
      "table0"."declaration" 
   FROM ( 
     SELECT "declaration", "identifier" 
     FROM ( 
       SELECT 
         "declaration", 
         "identifier", 
         ROW_NUMBER() OVER ( 
           PARTITION BY "identifier" /*, "declaration" */
           ORDER BY "identifier", "declaration" DESC  
         ) AS "tempcol0" 
     FROM  "subquery0" 
     ) AS "subquery1" 
     WHERE ( "tempcol0" = 1 )  
   )  AS "table0"  
   ORDER BY 
     "table0"."identifier", 
     "table0"."declaration" DESC
 );
 disconnect from db;
quit;

proc print data=have2;
run;

The last PROC PRINT produces:

Obs    identifier    declaration

 1     X             123456B
 2     Y             23456C

jakarman · Posted 10-08-2014 11:12 AM

Just going back to the environment. The %put _all_ is including the %put _automatic_ ; It will show the automatic systemvaraibles. This is UE on Windows.

SYSSCP and SYSSCPL wil tell the OS-system your sas-session is running on. As you are telling it is a sas7bdat file I am convinced it is all Windows desktop based. You are not using Eguide but the old classic DMS.

All your processing will be done by your desktop/laptop. The sizing of 2Gb memsize is rather small maybe a 32-bit OS and SAS or a VDI being used.

This sizing normally is belonging as default to a server session setting.

43 %put _automatic_ ;

AUTOMATIC AFDSID 0

AUTOMATIC AFDSNAME

AUTOMATIC AFLIB

AUTOMATIC AFSTR1

AUTOMATIC AFSTR2

AUTOMATIC FSPBDV

AUTOMATIC SYSADDRBITS 64

AUTOMATIC SYSBUFFR

AUTOMATIC SYSCC 0

AUTOMATIC SYSCHARWIDTH 1

AUTOMATIC SYSCMD

AUTOMATIC SYSDATASTEPPHASE

AUTOMATIC SYSDATE 08OCT14

AUTOMATIC SYSDATE9 08OCT2014

AUTOMATIC SYSDAY Wednesday

AUTOMATIC SYSDEVIC

AUTOMATIC SYSDMG 0

AUTOMATIC SYSDSN _NULL_

AUTOMATIC SYSENCODING utf-8

AUTOMATIC SYSENDIAN LITTLE

AUTOMATIC SYSENV BACK

AUTOMATIC SYSERR 0

AUTOMATIC SYSERRORTEXT Shell escape is not valid in this SAS session.

AUTOMATIC SYSFILRC 1

AUTOMATIC SYSHOSTINFOLONG Linux LIN X64 2.6.32-431.11.2.el6.x86_64 #1 SMP Tue Mar 25 19:59:55 UTC 2014 x86_64 CentOS release 6.5

(Final)

AUTOMATIC SYSHOSTNAME localhost

AUTOMATIC SYSINDEX 0

AUTOMATIC SYSINFO 0

AUTOMATIC SYSJOBID 24800

AUTOMATIC SYSLAST _NULL_

AUTOMATIC SYSLCKRC 0

AUTOMATIC SYSLIBRC 0

AUTOMATIC SYSLOGAPPLNAME

AUTOMATIC SYSMACRONAME

AUTOMATIC SYSMAXLONG 9007199254740992

AUTOMATIC SYSMENV S

AUTOMATIC SYSMSG

AUTOMATIC SYSNCPU 2

AUTOMATIC SYSNOBS 58

AUTOMATIC SYSODSESCAPECHAR

AUTOMATIC SYSODSGRAPHICS 1

AUTOMATIC SYSODSPATH WORK.TEMPLAT(UPDATE) SASUSER.TEMPLAT(READ) SASHELP.TMPLMST(READ)

AUTOMATIC SYSPARM

AUTOMATIC SYSPBUFF

AUTOMATIC SYSPROCESSID 41D9C14253A7EC464018000000000000

AUTOMATIC SYSPROCESSMODE SAS Workspace Server

AUTOMATIC SYSPROCESSNAME Object Server

AUTOMATIC SYSPROCNAME

AUTOMATIC SYSRC 0

AUTOMATIC SYSSCP LIN X64

AUTOMATIC SYSSCPL Linux

AUTOMATIC SYSSITE 70068118

AUTOMATIC SYSSIZEOFLONG 8

AUTOMATIC SYSSIZEOFPTR 8

AUTOMATIC SYSSIZEOFUNICODE 4

AUTOMATIC SYSSTARTID

AUTOMATIC SYSSTARTNAME

AUTOMATIC SYSTCPIPHOSTNAME localhost.localdomain

AUTOMATIC SYSTIME 10:28

AUTOMATIC SYSTIMEZONE

AUTOMATIC SYSTIMEZONEIDENT

AUTOMATIC SYSTIMEZONEOFFSET -14400

AUTOMATIC SYSUSERID sasdemo

AUTOMATIC SYSVER 9.4

AUTOMATIC SYSVLONG 9.04.01M1P120413

AUTOMATIC SYSVLONG4 9.04.01M1P12042013

AUTOMATIC SYSWARNINGTEXT

44

---->-- ja karman --<-----

jakarman · Posted 10-08-2014 04:53 PM

I have taken the test-set of Patrick and modified it so it that two vars. I added/changed the value to obspntr.

You can use it that way when having done the selection relating to the original record. Even using the point= method for accessing that one.

I added the log/fullstimer using UE. It is about 13M records with a datasetsize of 500MB. (ca 15s)

Getting the duplicates as last back.

The first solution is solving it as commented by using knowing declarations are clustered and ordered.

The by processing with notsorted will do and run in about 15sec.

When it is not sorted a sort step can solve that. Needing some additional time.

The second solution is solving it as hash. The loading of the has will order/sort the data. With this approach there is no sorting step needed.

Than reading back in a iteration a second hash filled as the last of the 14 chars. This runs in about 1m20 seconds.

I planned to remove keys objects but got a locking error because the iteration position was locking the position.

As shown the datastep processing is not used. You could do some on that using those hashes in memory. There 1,5Gb used.

I made I mistake on the test-dataset. 10 times more records. (135M) durations ha where 10minutes creating and last processing 5 minutes. The hash could not run.

The more is known of your processing and requirements the better the choices van be tailored to that.

Hundred of programs datasets? That sounds some effort on that is worth doing that.

44 data test.have(keep=identifier declaration obspnt);

45 attrib identifier length=$14 declaration length=$15.;

46 retain obspnt 0;

47 do i=1 to 1000000; /* no of records */

48 _value=ceil(ranuni(1)*12000000000);

49 identifier=putn(_value,'z14.0');

50 _stop=ceil(ranuni(2)*26 ); /* 26 letters at max, ascii letters latin-1 start at 40x */

51 do _i=1 to _stop;

52 declaration=cats(identifier,byte(64+_i));

53 obspnt+1;

54 output;

55 end;

56 end;

57 run;

NOTE: The data set TEST.HAVE has 13482024 observations and 3 variables.

NOTE: DATA statement used (Total process time):

real time 14.21 seconds

user cpu time 0.00 seconds

system cpu time 15.37 seconds

memory 379.96k

OS Memory 25236.00k

Timestamp 08-10-2014 04:40:07 PM

Step Count 42 Switch Count 74

Page Faults 0

Page Reclaims 68

Page Swaps 0

Voluntary Context Switches 272

Involuntary Context Switches 90

Block Input Operations 0

Block Output Operations 1058064

58

59 /* unique indentity unsorted, but decalartion within that is clustered and oredered wantes as last */

60 data test.want1 ;

61 set test.have ;

62 by identifier notsorted ;

63 if last.identifier;

64 run;

NOTE: There were 13482024 observations read from the data set TEST.HAVE.

NOTE: The data set TEST.WANT1 has 1000000 observations and 3 variables.

NOTE: DATA statement used (Total process time):

real time 15.41 seconds

user cpu time 0.00 seconds

system cpu time 15.96 seconds

memory 562.09k

OS Memory 25236.00k

Timestamp 08-10-2014 04:40:23 PM

Step Count 43 Switch Count 40

Page Faults 0

Page Reclaims 39

Page Swaps 0

Voluntary Context Switches 105

Involuntary Context Switches 245

Block Input Operations 0

Block Output Operations 78992

65

66 /* using just decalaration to build hash wiht highest letter value */

67 data notwant (drop=_:);

68 set sashelp.class end=last;

69

70 if _n_=1 then do;

71 length _declprvh _declprvt $14 ;

72 if 0 then set test.have(keep=declaration obspnt rename=(declaration=_decl));

73 dcl hash h1 (dataset:'test.have(keep=declaration obspnt rename=(declaration=_decl) obs=max ))', duplicate:'r',

73 ! ordered: 'yes', hashexp:20);

74 dcl hiter h1p('h1');

75 _rc=h1.defineKey('_decl');

76 _rc=h1.defineData('_decl','obspnt');

77 _rc=h1.defineDone();

78 dcl hash h2 ( ordered: 'yes', duplicate:'r', hashexp:20);

79 dcl hiter h2p('h2');

80 _rc=h2.defineKey('_decl');

81 _rc=h2.defineData('_decl','obspnt');

82 _rc=h2.defineDone();

83 call missing(_decl,obspnt);

84

85 _rcp=h1p.last();

86 do while ( not _rcp) ;

87 _declprvt=substr(_decl,1,14);

88 If _declprvt not = _declprvh then _rc=h2.add( );

89 _declprvh=_declprvt;

90 _rcp=h1p.prev();

91 end;

92 /* Create output data set from hash object */

93 _rc = h2.output(dataset:'test.want2(rename=(_decl=declaration))');

94 end;

95

96 /* your program code on class dataset */

97 run;

NOTE: There were 13482024 observations read from the data set TEST.HAVE.

WARNING: Hash Object DATASET option should be used when specifying the DUPLICATE option.

NOTE: The data set TEST.WANT2 has 1000000 observations and 2 variables.

NOTE: There were 19 observations read from the data set SASHELP.CLASS.

NOTE: The data set WORK.NOTWANT has 19 observations and 6 variables.

NOTE: DATA statement used (Total process time):

real time 1:21.50

user cpu time 30.95 seconds

system cpu time 47.99 seconds

memory 1523263.51k

OS Memory 1547520.00k

Timestamp 08-10-2014 04:41:44 PM

Step Count 44 Switch Count 154

Page Faults 0

Page Reclaims 98625

Page Swaps 0

Voluntary Context Switches 699

Involuntary Context Switches 1111

Block Input Operations 0

Block Output Operations 47656

---->-- ja karman --<-----

Aboiron · Posted 10-09-2014 02:41 AM

I know realize I should have said it before, but I have to order the table because I then merge it to another table by this identifier.

So as I understand your message (I do my best but it is a bit technical for me), the second method does that, the first not, so that if I add a proc sort to the first method, the second one will be more efficient ?

Will nevertheless the first one with a proc sort after be more efficient that what I proposed ?

I will try both, but I do not know at all if the hash table solution will be considered understandable enough to be validated.

Aboiron · Posted 10-09-2014 05:01 AM

Actually, the first version yields me about 2% more observations than my naive procedure I stated in my initial message.

I do not know at all why that is, but they seem to be valid.

jakarman · Posted 10-09-2014 05:11 AM

In my testdata I did not check for duplicates. When it is sorted they will be dropped automatically with the first solution.

The second approach has a unique request doing that.

I forgot to mention data-quality steps to very the assumptions are correct. A duplicate order seem to me incorrect/impossible.
Reporting on some descriptive numbers could help top alert you when running processes. It is the check and doublecheck approach of a good accountant.

---->-- ja karman --<-----

Aboiron · Posted 10-09-2014 05:45 AM

Last point was indeed my fault. Naive method and first one you proposed yield now the same table.

Sorry to still bother you, but how can the identifier.last be recognized as last if observations are not sorted by identifiers ?

jakarman · Posted 10-09-2014 02:59 AM

The first coding using the last is very clean to understand it performs excellent when grouped.
The using of the "last." condition is the required knowledge at this.

Delivered it sorted can be an easy part as it ordered coming in (SQL external DBMS) or by using the sort. Sort is a heavy load but very well optimized.

I could have done a better job with commenting the hash part as it is:

- reading the data and prepare it accessing ordered.

- starting at the end and than finding the last occurrence of a group,

That is is logical the same is the last. condition of the sort approach.

- Storing the data in an additional hash.

That is logical the same as the if statement on the last statement.

What I did is having a logical process in my mind, solved that in different coding techniques. The logical process desing could be, should be part of the documentation. At leas in the old days was asked to do that.

The advantage of the second method is additional processing can be done on an other dataset. using the hash for some indexing/merging. Needing some records (small part) of that original big dataset you can use the obspntr direct reading that one. This is the same fundamentals as a DBMS with indexing and SQL is doing. Now it is more a NO-SQL approach. NO_SQL is very modern and hyping again.

---->-- ja karman --<-----

Aboiron · Posted 10-09-2014 03:10 AM

Well thanks for all, I guess I have enough material to select a solution.

I will try to propose both as this is the rate-determining step. And also learning a bit about hash table sounds interesting to me.

But point is that the end users who will have to use it annually when I will have finished have even lesser technical background than I have, so I already have an idea about the solution my boss will choose.

Thanks a lot anyway !

jakarman · Posted 10-09-2014 06:20 AM

The only requirement for first / last processing is that the data is grouped. All records of the same type/group being in some block.

Sorting is more advanced way of grouping it adds the ordering to it.

Building that testdata I realized It was grouped but not ordered on the identification. The declaration however by the way it is generated is orrdered.

To gate that back just needing the last record in each group. (sorting step bypassed)

---->-- ja karman --<-----

Catch up on SAS Innovate 2026

SAS Training: Just a Click Away