you are right in the fact i want to nodupkey by identifier -it is what i do for now.
I can indeed skip a table (it can also be the first one, just sort it and keep it), but I do not know what is best in terms of time.
Because your source data is in a DBMS, it might be most efficient to perform the processing there. You can write your own SQL query and submit it using PROC SQL's explicit pass-through capability. If you're not quite sure how to write that query, you can look to PROC SORT for an example. When the NODUPKEY option is specified, PROC SORT can generate and submit SQL to a DBMS to perform the work. Obviously, the sorting BY identifier DESCENDING declaration with NODUPKEY isn't what you want but the SQL it generates can serve as your starting point. The flavor of SQL generated depends upon the DBMS being used. You can print the generated SQL to the log by specifying the SQL_IP_TRACE=SOURCE option.
Here's an example:
libname dbms ... ; /* Modify as appropriate for your DBMS */ proc delete data=dbms.have; run; data dbms.have; length identifier declaration $ 10; input identifier declaration; cards; Y 23456C Y 23456B Y 23456A X 123456A X 123456B ; run; options msglevel=i; options sql_ip_trace=source; /*************************************************** Use the SQL generated here as a starting point ***************************************************/ proc sort data=dbms.have out=have2 nodupkey; by identifier descending declaration; run; proc print data=have2; run; /*************************************************** For example, we take the generated SQL and submit it to the DBMS using explicit pass-through. This does not quite do what we want, though. ***************************************************/ proc sql nowarn; drop table have2; connect using dbms as db; create table have2 as select * from connection to db ( WITH "subquery0" ( "declaration", "identifier" ) AS ( SELECT "declaration" AS "declaration", "identifier" AS "identifier" FROM "have" ) SELECT "table0"."identifier", "table0"."declaration" FROM ( SELECT "declaration", "identifier" FROM ( SELECT "declaration", "identifier", ROW_NUMBER() OVER ( PARTITION BY "identifier", "declaration" ORDER BY "identifier", "declaration" DESC ) AS "tempcol0" FROM "subquery0" ) AS "subquery1" WHERE ( "tempcol0" = 1 ) ) AS "table0" ORDER BY "table0"."identifier", "table0"."declaration" DESC ); disconnect from db; quit; proc print data=have2; run; /*************************************************** But we can modify the SQL, removing the declaration column from the PARTITION BY clause, which should produce what we want ***************************************************/ proc sql nowarn; drop table have2; connect using dbms as db; create table have2 as select * from connection to db ( WITH "subquery0" ( "declaration", "identifier" ) AS ( SELECT "declaration" AS "declaration", "identifier" AS "identifier" FROM "have" ) SELECT "table0"."identifier", "table0"."declaration" FROM ( SELECT "declaration", "identifier" FROM ( SELECT "declaration", "identifier", ROW_NUMBER() OVER ( PARTITION BY "identifier" /*, "declaration" */ ORDER BY "identifier", "declaration" DESC ) AS "tempcol0" FROM "subquery0" ) AS "subquery1" WHERE ( "tempcol0" = 1 ) ) AS "table0" ORDER BY "table0"."identifier", "table0"."declaration" DESC ); disconnect from db; quit; proc print data=have2; run;
The last PROC PRINT produces:
Obs identifier declaration 1 X 123456B 2 Y 23456C
Just going back to the environment. The %put _all_ is including the %put _automatic_ ; It will show the automatic systemvaraibles. This is UE on Windows.
SYSSCP and SYSSCPL wil tell the OS-system your sas-session is running on. As you are telling it is a sas7bdat file I am convinced it is all Windows desktop based. You are not using Eguide but the old classic DMS.
All your processing will be done by your desktop/laptop. The sizing of 2Gb memsize is rather small maybe a 32-bit OS and SAS or a VDI being used.
This sizing normally is belonging as default to a server session setting.
43 %put _automatic_ ;
AUTOMATIC AFDSID 0
AUTOMATIC AFDSNAME
AUTOMATIC AFLIB
AUTOMATIC AFSTR1
AUTOMATIC AFSTR2
AUTOMATIC FSPBDV
AUTOMATIC SYSADDRBITS 64
AUTOMATIC SYSBUFFR
AUTOMATIC SYSCC 0
AUTOMATIC SYSCHARWIDTH 1
AUTOMATIC SYSCMD
AUTOMATIC SYSDATASTEPPHASE
AUTOMATIC SYSDATE 08OCT14
AUTOMATIC SYSDATE9 08OCT2014
AUTOMATIC SYSDAY Wednesday
AUTOMATIC SYSDEVIC
AUTOMATIC SYSDMG 0
AUTOMATIC SYSDSN _NULL_
AUTOMATIC SYSENCODING utf-8
AUTOMATIC SYSENDIAN LITTLE
AUTOMATIC SYSENV BACK
AUTOMATIC SYSERR 0
AUTOMATIC SYSERRORTEXT Shell escape is not valid in this SAS session.
AUTOMATIC SYSFILRC 1
AUTOMATIC SYSHOSTINFOLONG Linux LIN X64 2.6.32-431.11.2.el6.x86_64 #1 SMP Tue Mar 25 19:59:55 UTC 2014 x86_64 CentOS release 6.5
(Final)
AUTOMATIC SYSHOSTNAME localhost
AUTOMATIC SYSINDEX 0
AUTOMATIC SYSINFO 0
AUTOMATIC SYSJOBID 24800
AUTOMATIC SYSLAST _NULL_
AUTOMATIC SYSLCKRC 0
AUTOMATIC SYSLIBRC 0
AUTOMATIC SYSLOGAPPLNAME
AUTOMATIC SYSMACRONAME
AUTOMATIC SYSMAXLONG 9007199254740992
AUTOMATIC SYSMENV S
AUTOMATIC SYSMSG
AUTOMATIC SYSNCPU 2
AUTOMATIC SYSNOBS 58
AUTOMATIC SYSODSESCAPECHAR
AUTOMATIC SYSODSGRAPHICS 1
AUTOMATIC SYSODSPATH WORK.TEMPLAT(UPDATE) SASUSER.TEMPLAT(READ) SASHELP.TMPLMST(READ)
AUTOMATIC SYSPARM
AUTOMATIC SYSPBUFF
AUTOMATIC SYSPROCESSID 41D9C14253A7EC464018000000000000
AUTOMATIC SYSPROCESSMODE SAS Workspace Server
AUTOMATIC SYSPROCESSNAME Object Server
AUTOMATIC SYSPROCNAME
AUTOMATIC SYSRC 0
AUTOMATIC SYSSCP LIN X64
AUTOMATIC SYSSCPL Linux
AUTOMATIC SYSSITE 70068118
AUTOMATIC SYSSIZEOFLONG 8
AUTOMATIC SYSSIZEOFPTR 8
AUTOMATIC SYSSIZEOFUNICODE 4
AUTOMATIC SYSSTARTID
AUTOMATIC SYSSTARTNAME
AUTOMATIC SYSTCPIPHOSTNAME localhost.localdomain
AUTOMATIC SYSTIME 10:28
AUTOMATIC SYSTIMEZONE
AUTOMATIC SYSTIMEZONEIDENT
AUTOMATIC SYSTIMEZONEOFFSET -14400
AUTOMATIC SYSUSERID sasdemo
AUTOMATIC SYSVER 9.4
AUTOMATIC SYSVLONG 9.04.01M1P120413
AUTOMATIC SYSVLONG4 9.04.01M1P12042013
AUTOMATIC SYSWARNINGTEXT
44
I have taken the test-set of Patrick and modified it so it that two vars. I added/changed the value to obspntr.
You can use it that way when having done the selection relating to the original record. Even using the point= method for accessing that one.
I added the log/fullstimer using UE. It is about 13M records with a datasetsize of 500MB. (ca 15s)
Getting the duplicates as last back.
The first solution is solving it as commented by using knowing declarations are clustered and ordered.
The by processing with notsorted will do and run in about 15sec.
When it is not sorted a sort step can solve that. Needing some additional time.
The second solution is solving it as hash. The loading of the has will order/sort the data. With this approach there is no sorting step needed.
Than reading back in a iteration a second hash filled as the last of the 14 chars. This runs in about 1m20 seconds.
I planned to remove keys objects but got a locking error because the iteration position was locking the position.
As shown the datastep processing is not used. You could do some on that using those hashes in memory. There 1,5Gb used.
I made I mistake on the test-dataset. 10 times more records. (135M) durations ha where 10minutes creating and last processing 5 minutes. The hash could not run.
The more is known of your processing and requirements the better the choices van be tailored to that.
Hundred of programs datasets? That sounds some effort on that is worth doing that.
44 data test.have(keep=identifier declaration obspnt);
45 attrib identifier length=$14 declaration length=$15.;
46 retain obspnt 0;
47 do i=1 to 1000000; /* no of records */
48 _value=ceil(ranuni(1)*12000000000);
49 identifier=putn(_value,'z14.0');
50 _stop=ceil(ranuni(2)*26 ); /* 26 letters at max, ascii letters latin-1 start at 40x */
51 do _i=1 to _stop;
52 declaration=cats(identifier,byte(64+_i));
53 obspnt+1;
54 output;
55 end;
56 end;
57 run;
NOTE: The data set TEST.HAVE has 13482024 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 14.21 seconds
user cpu time 0.00 seconds
system cpu time 15.37 seconds
memory 379.96k
OS Memory 25236.00k
Timestamp 08-10-2014 04:40:07 PM
Step Count 42 Switch Count 74
Page Faults 0
Page Reclaims 68
Page Swaps 0
Voluntary Context Switches 272
Involuntary Context Switches 90
Block Input Operations 0
Block Output Operations 1058064
58
59 /* unique indentity unsorted, but decalartion within that is clustered and oredered wantes as last */
60 data test.want1 ;
61 set test.have ;
62 by identifier notsorted ;
63 if last.identifier;
64 run;
NOTE: There were 13482024 observations read from the data set TEST.HAVE.
NOTE: The data set TEST.WANT1 has 1000000 observations and 3 variables.
NOTE: DATA statement used (Total process time):
real time 15.41 seconds
user cpu time 0.00 seconds
system cpu time 15.96 seconds
memory 562.09k
OS Memory 25236.00k
Timestamp 08-10-2014 04:40:23 PM
Step Count 43 Switch Count 40
Page Faults 0
Page Reclaims 39
Page Swaps 0
Voluntary Context Switches 105
Involuntary Context Switches 245
Block Input Operations 0
Block Output Operations 78992
65
66 /* using just decalaration to build hash wiht highest letter value */
67 data notwant (drop=_:);
68 set sashelp.class end=last;
69
70 if _n_=1 then do;
71 length _declprvh _declprvt $14 ;
72 if 0 then set test.have(keep=declaration obspnt rename=(declaration=_decl));
73 dcl hash h1 (dataset:'test.have(keep=declaration obspnt rename=(declaration=_decl) obs=max ))', duplicate:'r',
73 ! ordered: 'yes', hashexp:20);
74 dcl hiter h1p('h1');
75 _rc=h1.defineKey('_decl');
76 _rc=h1.defineData('_decl','obspnt');
77 _rc=h1.defineDone();
78 dcl hash h2 ( ordered: 'yes', duplicate:'r', hashexp:20);
79 dcl hiter h2p('h2');
80 _rc=h2.defineKey('_decl');
81 _rc=h2.defineData('_decl','obspnt');
82 _rc=h2.defineDone();
83 call missing(_decl,obspnt);
84
85 _rcp=h1p.last();
86 do while ( not _rcp) ;
87 _declprvt=substr(_decl,1,14);
88 If _declprvt not = _declprvh then _rc=h2.add( );
89 _declprvh=_declprvt;
90 _rcp=h1p.prev();
91 end;
92 /* Create output data set from hash object */
93 _rc = h2.output(dataset:'test.want2(rename=(_decl=declaration))');
94 end;
95
96 /* your program code on class dataset */
97 run;
NOTE: There were 13482024 observations read from the data set TEST.HAVE.
WARNING: Hash Object DATASET option should be used when specifying the DUPLICATE option.
NOTE: The data set TEST.WANT2 has 1000000 observations and 2 variables.
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.NOTWANT has 19 observations and 6 variables.
NOTE: DATA statement used (Total process time):
real time 1:21.50
user cpu time 30.95 seconds
system cpu time 47.99 seconds
memory 1523263.51k
OS Memory 1547520.00k
Timestamp 08-10-2014 04:41:44 PM
Step Count 44 Switch Count 154
Page Faults 0
Page Reclaims 98625
Page Swaps 0
Voluntary Context Switches 699
Involuntary Context Switches 1111
Block Input Operations 0
Block Output Operations 47656
I know realize I should have said it before, but I have to order the table because I then merge it to another table by this identifier.
So as I understand your message (I do my best but it is a bit technical for me), the second method does that, the first not, so that if I add a proc sort to the first method, the second one will be more efficient ?
Will nevertheless the first one with a proc sort after be more efficient that what I proposed ?
I will try both, but I do not know at all if the hash table solution will be considered understandable enough to be validated.
Actually, the first version yields me about 2% more observations than my naive procedure I stated in my initial message.
I do not know at all why that is, but they seem to be valid.
In my testdata I did not check for duplicates. When it is sorted they will be dropped automatically with the first solution.
The second approach has a unique request doing that.
I forgot to mention data-quality steps to very the assumptions are correct. A duplicate order seem to me incorrect/impossible.
Reporting on some descriptive numbers could help top alert you when running processes. It is the check and doublecheck approach of a good accountant.
Last point was indeed my fault. Naive method and first one you proposed yield now the same table.
Sorry to still bother you, but how can the identifier.last be recognized as last if observations are not sorted by identifiers ?
The first coding using the last is very clean to understand it performs excellent when grouped.
The using of the "last." condition is the required knowledge at this.
Delivered it sorted can be an easy part as it ordered coming in (SQL external DBMS) or by using the sort. Sort is a heavy load but very well optimized.
I could have done a better job with commenting the hash part as it is:
- reading the data and prepare it accessing ordered.
- starting at the end and than finding the last occurrence of a group,
That is is logical the same is the last. condition of the sort approach.
- Storing the data in an additional hash.
That is logical the same as the if statement on the last statement.
What I did is having a logical process in my mind, solved that in different coding techniques. The logical process desing could be, should be part of the documentation. At leas in the old days was asked to do that.
The advantage of the second method is additional processing can be done on an other dataset. using the hash for some indexing/merging. Needing some records (small part) of that original big dataset you can use the obspntr direct reading that one. This is the same fundamentals as a DBMS with indexing and SQL is doing. Now it is more a NO-SQL approach. NO_SQL is very modern and hyping again.
Well thanks for all, I guess I have enough material to select a solution.
I will try to propose both as this is the rate-determining step. And also learning a bit about hash table sounds interesting to me.
But point is that the end users who will have to use it annually when I will have finished have even lesser technical background than I have, so I already have an idea about the solution my boss will choose.
Thanks a lot anyway !
The only requirement for first / last processing is that the data is grouped. All records of the same type/group being in some block.
Sorting is more advanced way of grouping it adds the ordering to it.
Building that testdata I realized It was grouped but not ordered on the identification. The declaration however by the way it is generated is orrdered.
To gate that back just needing the last record in each group. (sorting step bypassed)
Are you ready for the spotlight? We're accepting content ideas for SAS Innovate 2025 to be held May 6-9 in Orlando, FL. The call is open until September 25. Read more here about why you should contribute and what is in it for you!
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.