BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Aboiron
Calcite | Level 5

you are right in the fact i want to nodupkey by identifier -it is what i do for now.

I can indeed skip a table (it can also be the first one, just sort it and keep it), but I do not know what is best in terms of time.

scmebu
SAS Employee

Because your source data is in a DBMS, it might be most efficient to perform the processing there.  You can write your own SQL query and submit it using PROC SQL's explicit pass-through capability.  If you're not quite sure how to write that query, you can look to PROC SORT for an example.  When the NODUPKEY option is specified, PROC SORT can generate and submit SQL to a DBMS to perform the work.  Obviously, the sorting BY identifier DESCENDING declaration with NODUPKEY isn't what you want but the SQL it generates can serve as your starting point.  The flavor of SQL generated depends upon the DBMS being used.  You can print the generated SQL to the log by specifying the SQL_IP_TRACE=SOURCE option.

Here's an example:

libname dbms ... ; /* Modify as appropriate for your DBMS */

proc delete data=dbms.have;
run;

data dbms.have;
 length identifier declaration $ 10;
 input identifier declaration;
 cards;
Y 23456C
Y 23456B
Y 23456A
X 123456A
X 123456B
;
run;

options msglevel=i;
options sql_ip_trace=source;

/*************************************************** 
  Use the SQL generated here as a starting point 
 ***************************************************/
proc sort data=dbms.have out=have2 nodupkey;
 by identifier descending declaration;
run;

proc print data=have2;
run;

/*************************************************** 
  For example, we take the generated SQL and submit 
  it to the DBMS using explicit pass-through.  This
  does not quite do what we want, though.
 ***************************************************/
proc sql nowarn;
 drop table have2;
 connect using dbms as db;
 create table have2 as select * from connection to db
 (
   WITH "subquery0" ( "declaration", "identifier" ) AS ( 
     SELECT 
       "declaration" AS "declaration", 
       "identifier" AS "identifier" 
     FROM "have" 
   ) 
   SELECT 
      "table0"."identifier",
      "table0"."declaration" 
   FROM ( 
     SELECT "declaration", "identifier" 
     FROM ( 
       SELECT 
         "declaration", 
         "identifier", 
         ROW_NUMBER() OVER ( 
           PARTITION BY "identifier", "declaration"
           ORDER BY "identifier", "declaration" DESC  
         ) AS "tempcol0" 
     FROM  "subquery0" 
     ) AS "subquery1" 
     WHERE ( "tempcol0" = 1 )  
   )  AS "table0"  
   ORDER BY 
     "table0"."identifier", 
     "table0"."declaration" DESC
 );
 disconnect from db;
quit;

proc print data=have2;
run;

/*************************************************** 
  But we can modify the SQL, removing the 
  declaration column from the PARTITION BY clause,
  which should produce what we want
 ***************************************************/
proc sql nowarn;
 drop table have2;
 connect using dbms as db;
 create table have2 as select * from connection to db
 (
   WITH "subquery0" ( "declaration", "identifier" ) AS ( 
     SELECT 
       "declaration" AS "declaration", 
       "identifier" AS "identifier" 
     FROM "have" 
   ) 
   SELECT 
      "table0"."identifier",
      "table0"."declaration" 
   FROM ( 
     SELECT "declaration", "identifier" 
     FROM ( 
       SELECT 
         "declaration", 
         "identifier", 
         ROW_NUMBER() OVER ( 
           PARTITION BY "identifier" /*, "declaration" */
           ORDER BY "identifier", "declaration" DESC  
         ) AS "tempcol0" 
     FROM  "subquery0" 
     ) AS "subquery1" 
     WHERE ( "tempcol0" = 1 )  
   )  AS "table0"  
   ORDER BY 
     "table0"."identifier", 
     "table0"."declaration" DESC
 );
 disconnect from db;
quit;

proc print data=have2;
run;

The last PROC PRINT produces:

Obs    identifier    declaration

 1     X             123456B
 2     Y             23456C

jakarman
Barite | Level 11

Just going back to the environment.  The %put _all_ is including the %put _automatic_ ;   It will show the automatic systemvaraibles. This is UE on Windows.

SYSSCP and SYSSCPL wil tell the OS-system your sas-session is running on. As you are telling it is a sas7bdat file I am convinced it is all Windows desktop based. You are not using Eguide but the old classic DMS.

All your processing will be done by your desktop/laptop. The sizing of 2Gb memsize is rather small maybe a 32-bit OS and SAS or a VDI being used.

This sizing normally is belonging as default to a server session setting. 

43         %put _automatic_ ;

AUTOMATIC AFDSID 0

AUTOMATIC AFDSNAME

AUTOMATIC AFLIB

AUTOMATIC AFSTR1

AUTOMATIC AFSTR2

AUTOMATIC FSPBDV

AUTOMATIC SYSADDRBITS 64

AUTOMATIC SYSBUFFR

AUTOMATIC SYSCC 0

AUTOMATIC SYSCHARWIDTH 1

AUTOMATIC SYSCMD

AUTOMATIC SYSDATASTEPPHASE

AUTOMATIC SYSDATE 08OCT14

AUTOMATIC SYSDATE9 08OCT2014

AUTOMATIC SYSDAY Wednesday

AUTOMATIC SYSDEVIC

AUTOMATIC SYSDMG 0

AUTOMATIC SYSDSN         _NULL_                         

AUTOMATIC SYSENCODING utf-8

AUTOMATIC SYSENDIAN LITTLE

AUTOMATIC SYSENV BACK

AUTOMATIC SYSERR 0

AUTOMATIC SYSERRORTEXT Shell escape is not valid in this SAS session.

AUTOMATIC SYSFILRC 1

AUTOMATIC SYSHOSTINFOLONG Linux LIN X64 2.6.32-431.11.2.el6.x86_64 #1 SMP Tue Mar 25 19:59:55 UTC 2014 x86_64 CentOS release 6.5

(Final)

AUTOMATIC SYSHOSTNAME localhost

AUTOMATIC SYSINDEX 0

AUTOMATIC SYSINFO 0

AUTOMATIC SYSJOBID 24800

AUTOMATIC SYSLAST _NULL_

AUTOMATIC SYSLCKRC 0

AUTOMATIC SYSLIBRC 0

AUTOMATIC SYSLOGAPPLNAME

AUTOMATIC SYSMACRONAME

AUTOMATIC SYSMAXLONG 9007199254740992

AUTOMATIC SYSMENV S

AUTOMATIC SYSMSG

AUTOMATIC SYSNCPU 2                  

AUTOMATIC SYSNOBS 58

AUTOMATIC SYSODSESCAPECHAR 

AUTOMATIC SYSODSGRAPHICS 1

AUTOMATIC SYSODSPATH  WORK.TEMPLAT(UPDATE) SASUSER.TEMPLAT(READ) SASHELP.TMPLMST(READ)

AUTOMATIC SYSPARM

AUTOMATIC SYSPBUFF

AUTOMATIC SYSPROCESSID 41D9C14253A7EC464018000000000000

AUTOMATIC SYSPROCESSMODE SAS Workspace Server

AUTOMATIC SYSPROCESSNAME Object Server

AUTOMATIC SYSPROCNAME

AUTOMATIC SYSRC 0

AUTOMATIC SYSSCP LIN X64

AUTOMATIC SYSSCPL Linux

AUTOMATIC SYSSITE 70068118

AUTOMATIC SYSSIZEOFLONG 8

AUTOMATIC SYSSIZEOFPTR 8

AUTOMATIC SYSSIZEOFUNICODE 4

AUTOMATIC SYSSTARTID

AUTOMATIC SYSSTARTNAME

AUTOMATIC SYSTCPIPHOSTNAME localhost.localdomain

AUTOMATIC SYSTIME 10:28

AUTOMATIC SYSTIMEZONE

AUTOMATIC SYSTIMEZONEIDENT

AUTOMATIC SYSTIMEZONEOFFSET -14400

AUTOMATIC SYSUSERID sasdemo

AUTOMATIC SYSVER 9.4    

AUTOMATIC SYSVLONG 9.04.01M1P120413

AUTOMATIC SYSVLONG4 9.04.01M1P12042013

AUTOMATIC SYSWARNINGTEXT

44        

---->-- ja karman --<-----
jakarman
Barite | Level 11

I have taken the test-set of Patrick and modified it so it that two vars. I added/changed the value to obspntr.

You can use it that way when having done the selection relating to the original record. Even using the point= method for accessing that one.    

I added the log/fullstimer using UE. It is about 13M records with a datasetsize of 500MB.  (ca 15s)

Getting the duplicates as last back.

The first solution is solving it as commented by using knowing declarations are clustered and ordered.

The by processing with notsorted will do and run in about 15sec.

When it is not sorted a sort step can solve that. Needing some additional time.

The second solution is solving it as hash. The loading of the has will order/sort the data.  With this approach there is no sorting step needed.

Than reading back in a iteration a second hash filled as the last of the 14 chars. This runs in about 1m20 seconds.

I planned to remove keys objects but got a locking error because the iteration position was locking the position.

As shown the datastep processing is not used. You could do some on that using those hashes in memory. There 1,5Gb used. 

I made I mistake on the test-dataset. 10 times more records. (135M) durations ha    where 10minutes creating and last processing 5 minutes. The hash could not run.

The more is known of your processing and requirements the better the choices van be tailored to that.

Hundred of programs datasets? That sounds some effort on that is worth doing that.

44          data test.have(keep=identifier declaration obspnt);

45           attrib identifier length=$14  declaration length=$15.;

46           retain obspnt 0;

47           do i=1 to 1000000;  /* no of records */

48             _value=ceil(ranuni(1)*12000000000);

49             identifier=putn(_value,'z14.0');

50             _stop=ceil(ranuni(2)*26 ); /* 26 letters at max, ascii letters latin-1 start at 40x */

51             do _i=1 to _stop;

52              declaration=cats(identifier,byte(64+_i));

53              obspnt+1;

54              output;

55             end;

56           end;

57         run;

NOTE: The data set TEST.HAVE has 13482024 observations and 3 variables.

NOTE: DATA statement used (Total process time):

       real time           14.21 seconds

       user cpu time       0.00 seconds

       system cpu time     15.37 seconds

       memory              379.96k

       OS Memory           25236.00k

       Timestamp           08-10-2014 04:40:07 PM

       Step Count                        42  Switch Count  74

       Page Faults                       0

       Page Reclaims                     68

       Page Swaps                        0

       Voluntary Context Switches        272

       Involuntary Context Switches      90

       Block Input Operations            0

       Block Output Operations           1058064

      

58        

59         /* unique indentity unsorted, but decalartion within that is clustered and oredered wantes as last */

60         data test.want1 ;

61          set test.have ;

62          by identifier notsorted  ;

63          if last.identifier;

64          run;

NOTE: There were 13482024 observations read from the data set TEST.HAVE.

NOTE: The data set TEST.WANT1 has 1000000 observations and 3 variables.

NOTE: DATA statement used (Total process time):

       real time           15.41 seconds

       user cpu time       0.00 seconds

       system cpu time     15.96 seconds

       memory              562.09k

       OS Memory           25236.00k

       Timestamp           08-10-2014 04:40:23 PM

       Step Count                        43  Switch Count  40

       Page Faults                       0

       Page Reclaims                     39

       Page Swaps                        0

       Voluntary Context Switches        105

       Involuntary Context Switches      245

       Block Input Operations            0

       Block Output Operations           78992

      

65        

66         /* using just decalaration to build hash wiht highest letter value */

67         data notwant (drop=_:);

68           set sashelp.class end=last;

69        

70           if _n_=1 then do;

71               length _declprvh _declprvt $14 ;

72               if 0 then set test.have(keep=declaration obspnt rename=(declaration=_decl));

73               dcl hash h1 (dataset:'test.have(keep=declaration obspnt rename=(declaration=_decl) obs=max ))', duplicate:'r',

73       ! ordered: 'yes', hashexp:20);

74               dcl hiter h1p('h1');

75               _rc=h1.defineKey('_decl');

76               _rc=h1.defineData('_decl','obspnt');

77               _rc=h1.defineDone();

78               dcl hash h2 ( ordered: 'yes', duplicate:'r', hashexp:20);

79               dcl hiter h2p('h2');

80               _rc=h2.defineKey('_decl');

81               _rc=h2.defineData('_decl','obspnt');

82               _rc=h2.defineDone();

83               call missing(_decl,obspnt);

84        

85               _rcp=h1p.last();

86               do while ( not _rcp) ;

87                 _declprvt=substr(_decl,1,14);

88                 If _declprvt not = _declprvh then _rc=h2.add( );

89                 _declprvh=_declprvt;

90                 _rcp=h1p.prev();

91               end;

92               /* Create output data set from hash object */

93               _rc = h2.output(dataset:'test.want2(rename=(_decl=declaration))');

94           end;

95        

96           /* your program code on class dataset  */

97         run;

NOTE: There were 13482024 observations read from the data set TEST.HAVE.

WARNING: Hash Object DATASET option should be used when specifying the DUPLICATE option.

NOTE: The data set TEST.WANT2 has 1000000 observations and 2 variables.

NOTE: There were 19 observations read from the data set SASHELP.CLASS.

NOTE: The data set WORK.NOTWANT has 19 observations and 6 variables.

NOTE: DATA statement used (Total process time):

       real time           1:21.50

       user cpu time       30.95 seconds

       system cpu time     47.99 seconds

       memory              1523263.51k

       OS Memory           1547520.00k

       Timestamp           08-10-2014 04:41:44 PM

       Step Count                        44  Switch Count  154

       Page Faults                       0

       Page Reclaims                     98625

       Page Swaps                        0

       Voluntary Context Switches        699

       Involuntary Context Switches      1111

       Block Input Operations            0

       Block Output Operations           47656

---->-- ja karman --<-----
Aboiron
Calcite | Level 5

I know realize I should have said it before, but I have to order the table because I then merge it to another table by this identifier.

So as I understand your message (I do my best but it is a bit technical for me), the second method does that, the first not, so that if I add a proc sort to the first method, the second one will be more efficient ?

Will nevertheless the first one with a proc sort after be more efficient that what I proposed ?

I will try both, but I do not know at all if the hash table solution will be considered understandable enough to be validated.

Aboiron
Calcite | Level 5

Actually, the first version yields me about 2% more observations than my naive procedure I stated in my initial message.

I do not know at all why that is, but they seem to be valid.

jakarman
Barite | Level 11

In my testdata I did not check for duplicates. When it is sorted they will be dropped automatically with the first solution.

The second approach  has a unique request doing that.

I forgot to mention data-quality steps to very the assumptions are correct.  A duplicate order seem to me incorrect/impossible.
Reporting on some descriptive numbers could help top alert you when running processes. It is the check and doublecheck approach of a good accountant.  

---->-- ja karman --<-----
Aboiron
Calcite | Level 5

Last point was indeed my fault. Naive method and first one you proposed yield now the same table.

Sorry to still bother you, but how can the identifier.last be recognized as last if observations are not sorted by identifiers ?

jakarman
Barite | Level 11

The first coding using the last is very clean to understand it performs excellent when grouped.
The using of the "last." condition is the required knowledge at this.  

Delivered it sorted can be an easy part as it ordered coming in (SQL external DBMS) or by using the sort. Sort is a heavy load but very well optimized.

I could have done a better job with commenting the hash part as it is:

- reading the data and prepare it accessing ordered. 

- starting at the end and than finding the last occurrence of a group,

   That is is logical the same is the last. condition of the sort approach.

- Storing the data in an additional hash.  

   That is logical the same as the if statement on the last statement.

What I did is having a logical process in my mind, solved  that in different coding techniques. The logical process desing could be, should be part of the documentation. At leas in the old days was asked to do that. 

The advantage of the second method is additional processing can be done on an other dataset. using the hash for some indexing/merging. Needing some records (small part) of that original big dataset you can use the obspntr direct reading that one. This is the same fundamentals as a DBMS with indexing and SQL is doing. Now it is more a NO-SQL approach.    NO_SQL is very modern and hyping again.

---->-- ja karman --<-----
Aboiron
Calcite | Level 5

Well thanks for all, I guess I have enough material to select a solution.

I will try to propose both  as this is the rate-determining step. And also learning a bit about hash table sounds interesting to me.

But point is that the end users who will have to use it annually when I will have finished have even lesser technical background than I have, so I already have an idea about the solution my boss will choose.

Thanks a lot anyway !

jakarman
Barite | Level 11

The only requirement for first / last processing is that the data is grouped. All records of the same type/group being in some block.

Sorting is more advanced way of grouping it adds the ordering to it.

Building that testdata I realized It was grouped but not ordered on the identification. The declaration however by the way it is generated is orrdered.

To gate that back just needing the last record in each group. (sorting step bypassed) 

---->-- ja karman --<-----

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 6127 views
  • 6 likes
  • 8 in conversation