Solved: SAS Most efficient way to eliminate duplicate

Aboiron · Posted 10-08-2014 07:59 AM

I have a database with an identifier and declarations. Declarations are constructed as identifier + a letter. If the idendifier is 123456, declarations would then be "123456A", "123456B" and so on

I would like to select one observation for each identifier, with the declaration that is the one with the last letter, which is of course, not always the same.

I assume I can do that with a proc sort and then another one with nodupkey :

proc sort data=have out=have2; 
     by identifier declaration /descending; *
run; 

proc sort data=have2 out=want nodupkey; 
by identifier; 
run;

but as I have a relatively important database (tens of millions observations) I would like to know the best in sense of both better suited and fastest method if it is another one. Typically, if it is possible in one step, as for now it takes a lot of time.

Thanks

Edit1 : change a typo in program

jakarman · Posted 10-08-2014 04:53 PM

I have taken the test-set of Patrick and modified it so it that two vars. I added/changed the value to obspntr.

You can use it that way when having done the selection relating to the original record. Even using the point= method for accessing that one.

I added the log/fullstimer using UE. It is about 13M records with a datasetsize of 500MB. (ca 15s)

Getting the duplicates as last back.

The first solution is solving it as commented by using knowing declarations are clustered and ordered.

The by processing with notsorted will do and run in about 15sec.

When it is not sorted a sort step can solve that. Needing some additional time.

The second solution is solving it as hash. The loading of the has will order/sort the data. With this approach there is no sorting step needed.

Than reading back in a iteration a second hash filled as the last of the 14 chars. This runs in about 1m20 seconds.

I planned to remove keys objects but got a locking error because the iteration position was locking the position.

As shown the datastep processing is not used. You could do some on that using those hashes in memory. There 1,5Gb used.

I made I mistake on the test-dataset. 10 times more records. (135M) durations ha where 10minutes creating and last processing 5 minutes. The hash could not run.

The more is known of your processing and requirements the better the choices van be tailored to that.

Hundred of programs datasets? That sounds some effort on that is worth doing that.

44 data test.have(keep=identifier declaration obspnt);

45 attrib identifier length=$14 declaration length=$15.;

46 retain obspnt 0;

47 do i=1 to 1000000; /* no of records */

48 _value=ceil(ranuni(1)*12000000000);

49 identifier=putn(_value,'z14.0');

50 _stop=ceil(ranuni(2)*26 ); /* 26 letters at max, ascii letters latin-1 start at 40x */

51 do _i=1 to _stop;

52 declaration=cats(identifier,byte(64+_i));

53 obspnt+1;

54 output;

55 end;

56 end;

57 run;

NOTE: The data set TEST.HAVE has 13482024 observations and 3 variables.

NOTE: DATA statement used (Total process time):

real time 14.21 seconds

user cpu time 0.00 seconds

system cpu time 15.37 seconds

memory 379.96k

OS Memory 25236.00k

Timestamp 08-10-2014 04:40:07 PM

Step Count 42 Switch Count 74

Page Faults 0

Page Reclaims 68

Page Swaps 0

Voluntary Context Switches 272

Involuntary Context Switches 90

Block Input Operations 0

Block Output Operations 1058064

58

59 /* unique indentity unsorted, but decalartion within that is clustered and oredered wantes as last */

60 data test.want1 ;

61 set test.have ;

62 by identifier notsorted ;

63 if last.identifier;

64 run;

NOTE: There were 13482024 observations read from the data set TEST.HAVE.

NOTE: The data set TEST.WANT1 has 1000000 observations and 3 variables.

NOTE: DATA statement used (Total process time):

real time 15.41 seconds

user cpu time 0.00 seconds

system cpu time 15.96 seconds

memory 562.09k

OS Memory 25236.00k

Timestamp 08-10-2014 04:40:23 PM

Step Count 43 Switch Count 40

Page Faults 0

Page Reclaims 39

Page Swaps 0

Voluntary Context Switches 105

Involuntary Context Switches 245

Block Input Operations 0

Block Output Operations 78992

65

66 /* using just decalaration to build hash wiht highest letter value */

67 data notwant (drop=_:);

68 set sashelp.class end=last;

69

70 if _n_=1 then do;

71 length _declprvh _declprvt $14 ;

72 if 0 then set test.have(keep=declaration obspnt rename=(declaration=_decl));

73 dcl hash h1 (dataset:'test.have(keep=declaration obspnt rename=(declaration=_decl) obs=max ))', duplicate:'r',

73 ! ordered: 'yes', hashexp:20);

74 dcl hiter h1p('h1');

75 _rc=h1.defineKey('_decl');

76 _rc=h1.defineData('_decl','obspnt');

77 _rc=h1.defineDone();

78 dcl hash h2 ( ordered: 'yes', duplicate:'r', hashexp:20);

79 dcl hiter h2p('h2');

80 _rc=h2.defineKey('_decl');

81 _rc=h2.defineData('_decl','obspnt');

82 _rc=h2.defineDone();

83 call missing(_decl,obspnt);

84

85 _rcp=h1p.last();

86 do while ( not _rcp) ;

87 _declprvt=substr(_decl,1,14);

88 If _declprvt not = _declprvh then _rc=h2.add( );

89 _declprvh=_declprvt;

90 _rcp=h1p.prev();

91 end;

92 /* Create output data set from hash object */

93 _rc = h2.output(dataset:'test.want2(rename=(_decl=declaration))');

94 end;

95

96 /* your program code on class dataset */

97 run;

NOTE: There were 13482024 observations read from the data set TEST.HAVE.

WARNING: Hash Object DATASET option should be used when specifying the DUPLICATE option.

NOTE: The data set TEST.WANT2 has 1000000 observations and 2 variables.

NOTE: There were 19 observations read from the data set SASHELP.CLASS.

NOTE: The data set WORK.NOTWANT has 19 observations and 6 variables.

NOTE: DATA statement used (Total process time):

real time 1:21.50

user cpu time 30.95 seconds

system cpu time 47.99 seconds

memory 1523263.51k

OS Memory 1547520.00k

Timestamp 08-10-2014 04:41:44 PM

Step Count 44 Switch Count 154

Page Faults 0

Page Reclaims 98625

Page Swaps 0

Voluntary Context Switches 699

Involuntary Context Switches 1111

Block Input Operations 0

Block Output Operations 47656

---->-- ja karman --<-----

View solution in original post

jakarman · Posted 10-08-2014 08:20 AM

Nice question. And a lot more to choose a direction.

- What SAS version do you have,

- What system (Windows/Unix/Mainframe) and with what kind of resouces (cores memory IO)

- Where is that database? Is at an external one (like Oracle) or SAS dedicated.

- How fast should it run? (seconds? several minutes?)

Are both character identification and declaration character based? The character could get separated? The identification is integer number of max 10 digits?
You just want of that one resulting dataset or is there more?

---->-- ja karman --<-----

Aboiron · Posted 10-08-2014 08:43 AM

So in order :

-I have SAS 9.2

-I am on windows 7, and I am working on a company cloud which I do not have much info about

-It is a sas-dedicated database

-There is not that important constraints, faster is better, but it is part of a bigger program, which is planned to be run annualy. The whole process can take a week or two.

-identifier is a fourtenn-digit character variable, declaration a fifteen-digit character one

-I can have the letter by a substr procedure but i guess you knew that so I am not sure of what you mean about the character that can be serparated

-It is just a part of a big program (not sure it was the question though)

Thx in advance

RW9 · Posted 10-08-2014 09:14 AM

Hi,

I don't actually have SAS right now, but my suggestion would be this. First take a distinct list of identifiers e.g.:

proc sql;

create table IDS as

select distinct

IDENTIFIER,

max(rank(DECLARATION)) as XX /* Just guessing here, i.e. get the max ASCII number of the letter*/

from YOUR_DATA;

quit;

Then use that to create new table:

proc sql;

create table WANT_DATA as

select *

from YOUR_DATA

where IDENTIFIER||DECLARATION in (select distinct ... from IDS);

quit;

Don't know how quick it would be though. Maybe do it in pieces, e.g. 1 id per time.

jakarman · Posted 10-08-2014 09:15 AM

Going for that ten of millions (20M? 60M?) of observations and that two variables we are talking about a length of 30 bytes (latin1) that is (60M) 1,8 Gb.
It is big but with a modern pc charged with 8Gb and 8 cores 500Gb dasd (quality laptop) all should be an easy fit for in memory processing.

When you are using SAS at windows it will allow you to do that. Please check memsize (and OS) with: " proc options;run; put _all_ ; "

Sequential processing reading once all data is commonly faster than random io unless you are using SSD-s.
A fast solution could building op the key as a Hash. It will deliver a sorted table in memory. Working back Last/Prev should give easy the highest value (case sensitve) for every declaration.
SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition Only that fifteen long key is needed as a hash when the quality assurance is that the first 14 chars are equal.

In that case it would be one load reading back for selection. Options:

a/ change data/update in a dataset and doing an output creating a new dataset.

b/ Modifying the hash (deleting the lower values) do a hash table save

Both can be combined with some other processing on the data. eliminating other steps.

Just an annual run? That one also most likely to change annully. maintainability may be more important as speed of processing.

---->-- ja karman --<-----

Aboiron · Posted 10-08-2014 09:36 AM

To be more precise I have to do that a hundred times on databases that have beween 500 000 and two millions observations.

Memsize is 2147483648, not sure where to find the OS

It will be just an annual run so of course maintainability is key, but for now I have to run it a lot in order to test some features. Moreover it is only a small part of my program, which is also only a small part of the whole program that has to be run each year, so every time gained is still precious

Patrick · Posted 10-08-2014 09:44 AM

You still haven't told us if your data is stored as SAS files or resides in a data base (and which one). This information is crucial to come up with a "best" solution.

Below sample code is for data stored in SAS files ('s option b).

If your data is stored in a data base then - depending on the data base - using SQL analytical functions would eventually be the a very efficient way ORACLE-BASE - FIRST_VALUE and LAST_VALUE Analytic Functions

options fullstimer;

data have(keep=identifier declaration value);

attrib identifier length=8 format=best32. declaration length=$9.;

do identifier=1 to 10000000;

_stop=ceil(ranuni(1)*10);

do _i=1 to _stop;

declaration=cats(identifier,byte(_i+64));

value+1;

output;

end;

run;

data want (drop=_:);

set have end=last;

if _n_=1 then

do;

if 0 then

set have(keep=declaration rename=(declaration=_decl));

dcl hash h1 (dataset:'have(keep=identifier declaration rename=(declaration=_decl))', hashexp:9);

_rc=h1.defineKey('identifier');

_rc=h1.defineData('_decl');

_rc=h1.defineDone();

end;

call missing(_decl);

_rc=h1.find();

if declaration>_decl then

h1.replace(key:identifier,data:declaration);

if last then

do _i=1 to _nobs;

set have nobs=_nobs point=_i;

call missing(_decl);

_rc=h1.find();

if declaration=_decl then output;

end;

run;

Aboiron · Posted 10-08-2014 09:46 AM

I have, at least I think so : "-It is a sas-dedicated database"

That is I have a library pointing to a directory in which are stored my databases as .sas7bdat

Patrick · Posted 10-08-2014 10:14 AM

O.K. - then using a hash approach will be hard to beat. I've loaded 10M records into the hash which consumed around 650MB RAM. So even if your key/data combination consumes more storage space, your volumes are quite a bit lower but your memory setting indicates that you're having up to 2GB at your disposal.

"To be more precise I have to do that a hundred times on databases that have beween 500 000 and two millions observations."

Then you definitely want to go for code as optimized as possible and then "wrap" this into a SAS macro so that you have to maintain the "complexity" in a single place only.

On a side note - and more for actually - the code variant with a second data step and a hash with only a key defined (with 2 variables) consumes 800MB RAM. Not sure why and this is may-be the reason why Paul Dorfman mentioned in another discussion that we need to define a single byte variable as "data" for the hash in such cases.

Patrick · Posted 10-08-2014 09:58 AM

And here Jaap's option a) - I believe. Runs a bit longer and uses a bit more resources but is may-be a bit better to understand/maintain.

options fullstimer;

data have(keep=identifier declaration value);

attrib identifier length=8 format=best32. declaration length=$9.;

do identifier=1 to 10000000;

_stop=ceil(ranuni(1)*10);

do _i=1 to _stop;

declaration=cats(identifier,byte(_i+64));

value+1;

output;

end;

run;

data _null_;

set have end=last;

if _n_=1 then

do;

if 0 then

set have(keep=declaration rename=(declaration=_decl));

dcl hash h1 (dataset:'have(keep=identifier declaration rename=(declaration=_decl))');

_rc=h1.defineKey('identifier');

_rc=h1.defineData('identifier','_decl');

_rc=h1.defineDone();

end;

call missing(_decl);

_rc=h1.find();

if declaration>_decl then

h1.replace(key:identifier,data:identifier,data:declaration);

if last then h1.output(dataset:'selection(rename=(_decl=declaration))');

run;

data want;

set have;

if _n_=1 then

do;

dcl hash h1 (dataset:'selection');

_rc=h1.defineKey('identifier','declaration');

_rc=h1.defineDone();

end;

if h1.check()=0 then output;

run;

Aboiron · Posted 10-08-2014 10:02 AM

Thx, I will look at it tomorrow when more focused.

Patrick · Posted 10-08-2014 10:45 AM

The one thing which makes me thinking is that you say it's about hundreds of tables AND end of year processing. That's normally the time where system resources are the most scarce.

If the hash doesn't get the required memory then the job will fail. I'm not 100% sure that if a memsize of 2GB has been defined that you have this amount of memory guaranteed. Hope someone else can shed some light on this.

I/O and disk space is at the end of the year normally also in high demand but normally the effect of shared I/O is only a decrease in performance and not job failure. So if you want to be on the very safe side then using a sort is eventually better for your situation. Performance will likely be worse due to increased I/O - especially for tables with big records (lots of variables or very long variables).

Below a code option for such an approach. The sample code performs actually quite well - but the record size is also quite low (only one additional 8 byte variable).

options fullstimer;

data have(keep=identifier declaration value);

attrib identifier length=8 format=best32. declaration length=$9.;

do identifier=1 to 10000000;

_stop=ceil(ranuni(1)*10);

do _i=1 to _stop;

declaration=cats(identifier,byte(_i+64));

value+1;

output;

end;

run;

proc sql;

create view v_have as

select *

from have

order by identifier, declaration

;

quit;

data want;

set v_have;

by identifier declaration;

if last.identifier;

run;

LinusH · Posted 10-08-2014 04:06 PM

There is something I don't understand.

You have an annually updated table.

What are the other "hundreds" of programs you wish to run, it sounded like that they were equal/similar?

If this is the "one" type of query you wish to optimize, just store the table in sorted order by identifier and declaration. Then you just need to do simple table scans using set by with last. (or first. depending on table sort order) logic.

This sounds too simple so I presume that I missed something in this conversation?

To speed up table scans, try store it in a SPDE libname, using the fastest disks you can get.

Data never sleeps

Astounding · Posted 10-08-2014 09:42 AM

There are others who know the syntax better than I do, but the best approach is likely to be a hash table. Load the data into a hash table, using REPLACE for duplicate key values. Then unload the hash table.

All assuming that you have sufficient memory. Good luck.

art297 · Posted 10-08-2014 09:49 AM

Regardless of which approach you end up using, I don't think your logic is correct and you don't need to write the 3rd file. Aren't you really just trying to accomplish something like?:

proc sort data=have out=want;

by identifier descending declaration; /*<-I changed this statement*/

run;

proc sort data=want nodupkey;

by identifier; /*<-I changed this statement*/

run;

SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

Re: SAS Most efficient way to eliminate duplicate

The 2025 SAS Hackathon has begun!

SAS Training: Just a Click Away