BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Aboiron
Calcite | Level 5

I have a database with an identifier and declarations. Declarations are constructed as identifier + a letter. If the idendifier is 123456, declarations would then be "123456A", "123456B" and so on

I would like to select one observation for each identifier, with the declaration that is the one with the last letter, which is of course, not always the same.

I assume I can do that with a proc sort and then another one with nodupkey :

proc sort data=have out=have2;

     by identifier declaration /descending; *

run;

proc sort data=have2 out=want nodupkey;

by identifier;

run;

but as I have a relatively important database (tens of millions observations) I would like to know the best in sense of both better suited and fastest method if it is another one. Typically, if it is possible in one step, as for now it takes a lot of time.

Thanks

Edit1 : change a typo in program

1 ACCEPTED SOLUTION

Accepted Solutions
jakarman
Barite | Level 11

I have taken the test-set of Patrick and modified it so it that two vars. I added/changed the value to obspntr.

You can use it that way when having done the selection relating to the original record. Even using the point= method for accessing that one.    

I added the log/fullstimer using UE. It is about 13M records with a datasetsize of 500MB.  (ca 15s)

Getting the duplicates as last back.

The first solution is solving it as commented by using knowing declarations are clustered and ordered.

The by processing with notsorted will do and run in about 15sec.

When it is not sorted a sort step can solve that. Needing some additional time.

The second solution is solving it as hash. The loading of the has will order/sort the data.  With this approach there is no sorting step needed.

Than reading back in a iteration a second hash filled as the last of the 14 chars. This runs in about 1m20 seconds.

I planned to remove keys objects but got a locking error because the iteration position was locking the position.

As shown the datastep processing is not used. You could do some on that using those hashes in memory. There 1,5Gb used. 

I made I mistake on the test-dataset. 10 times more records. (135M) durations ha    where 10minutes creating and last processing 5 minutes. The hash could not run.

The more is known of your processing and requirements the better the choices van be tailored to that.

Hundred of programs datasets? That sounds some effort on that is worth doing that.

44          data test.have(keep=identifier declaration obspnt);

45           attrib identifier length=$14  declaration length=$15.;

46           retain obspnt 0;

47           do i=1 to 1000000;  /* no of records */

48             _value=ceil(ranuni(1)*12000000000);

49             identifier=putn(_value,'z14.0');

50             _stop=ceil(ranuni(2)*26 ); /* 26 letters at max, ascii letters latin-1 start at 40x */

51             do _i=1 to _stop;

52              declaration=cats(identifier,byte(64+_i));

53              obspnt+1;

54              output;

55             end;

56           end;

57         run;

NOTE: The data set TEST.HAVE has 13482024 observations and 3 variables.

NOTE: DATA statement used (Total process time):

       real time           14.21 seconds

       user cpu time       0.00 seconds

       system cpu time     15.37 seconds

       memory              379.96k

       OS Memory           25236.00k

       Timestamp           08-10-2014 04:40:07 PM

       Step Count                        42  Switch Count  74

       Page Faults                       0

       Page Reclaims                     68

       Page Swaps                        0

       Voluntary Context Switches        272

       Involuntary Context Switches      90

       Block Input Operations            0

       Block Output Operations           1058064

      

58        

59         /* unique indentity unsorted, but decalartion within that is clustered and oredered wantes as last */

60         data test.want1 ;

61          set test.have ;

62          by identifier notsorted  ;

63          if last.identifier;

64          run;

NOTE: There were 13482024 observations read from the data set TEST.HAVE.

NOTE: The data set TEST.WANT1 has 1000000 observations and 3 variables.

NOTE: DATA statement used (Total process time):

       real time           15.41 seconds

       user cpu time       0.00 seconds

       system cpu time     15.96 seconds

       memory              562.09k

       OS Memory           25236.00k

       Timestamp           08-10-2014 04:40:23 PM

       Step Count                        43  Switch Count  40

       Page Faults                       0

       Page Reclaims                     39

       Page Swaps                        0

       Voluntary Context Switches        105

       Involuntary Context Switches      245

       Block Input Operations            0

       Block Output Operations           78992

      

65        

66         /* using just decalaration to build hash wiht highest letter value */

67         data notwant (drop=_:);

68           set sashelp.class end=last;

69        

70           if _n_=1 then do;

71               length _declprvh _declprvt $14 ;

72               if 0 then set test.have(keep=declaration obspnt rename=(declaration=_decl));

73               dcl hash h1 (dataset:'test.have(keep=declaration obspnt rename=(declaration=_decl) obs=max ))', duplicate:'r',

73       ! ordered: 'yes', hashexp:20);

74               dcl hiter h1p('h1');

75               _rc=h1.defineKey('_decl');

76               _rc=h1.defineData('_decl','obspnt');

77               _rc=h1.defineDone();

78               dcl hash h2 ( ordered: 'yes', duplicate:'r', hashexp:20);

79               dcl hiter h2p('h2');

80               _rc=h2.defineKey('_decl');

81               _rc=h2.defineData('_decl','obspnt');

82               _rc=h2.defineDone();

83               call missing(_decl,obspnt);

84        

85               _rcp=h1p.last();

86               do while ( not _rcp) ;

87                 _declprvt=substr(_decl,1,14);

88                 If _declprvt not = _declprvh then _rc=h2.add( );

89                 _declprvh=_declprvt;

90                 _rcp=h1p.prev();

91               end;

92               /* Create output data set from hash object */

93               _rc = h2.output(dataset:'test.want2(rename=(_decl=declaration))');

94           end;

95        

96           /* your program code on class dataset  */

97         run;

NOTE: There were 13482024 observations read from the data set TEST.HAVE.

WARNING: Hash Object DATASET option should be used when specifying the DUPLICATE option.

NOTE: The data set TEST.WANT2 has 1000000 observations and 2 variables.

NOTE: There were 19 observations read from the data set SASHELP.CLASS.

NOTE: The data set WORK.NOTWANT has 19 observations and 6 variables.

NOTE: DATA statement used (Total process time):

       real time           1:21.50

       user cpu time       30.95 seconds

       system cpu time     47.99 seconds

       memory              1523263.51k

       OS Memory           1547520.00k

       Timestamp           08-10-2014 04:41:44 PM

       Step Count                        44  Switch Count  154

       Page Faults                       0

       Page Reclaims                     98625

       Page Swaps                        0

       Voluntary Context Switches        699

       Involuntary Context Switches      1111

       Block Input Operations            0

       Block Output Operations           47656

---->-- ja karman --<-----

View solution in original post

25 REPLIES 25
jakarman
Barite | Level 11

Nice question. And a lot more to choose a direction.

- What SAS version do you have,

- What system (Windows/Unix/Mainframe) and with what kind of resouces (cores memory IO)

- Where is that database? Is at an external one (like Oracle) or SAS dedicated.

- How fast should it run? (seconds? several minutes?)   

Are both character identification and declaration character based? The character could get separated? The identification is integer number of max 10 digits?
You just want of that one resulting dataset or is there more?

---->-- ja karman --<-----
Aboiron
Calcite | Level 5

So in order :

-I have SAS 9.2

-I am on windows 7, and I am working on a company cloud which I do not have much info about

-It is a sas-dedicated database

-There is not that important constraints, faster is better, but it is part of a bigger program, which is planned to be run annualy. The whole process can take a week or two.

-identifier is a fourtenn-digit character variable, declaration a fifteen-digit character one

-I can have the letter by a substr procedure but i guess you knew that so I am not sure of what you mean about the character that can be serparated

-It is just a part of a big program (not sure it was the question though)

Thx in advance

RW9
Diamond | Level 26 RW9
Diamond | Level 26

Hi,


I don't actually have SAS right now, but my suggestion would be this.  First take a distinct list of identifiers e.g.:

proc sql;

     create table IDS as

     select     distinct    

                    IDENTIFIER,

                    max(rank(DECLARATION)) as XX /* Just guessing here, i.e. get the max ASCII number of the letter*/

     from         YOUR_DATA;

quit;

Then use that to create new table:

proc sql;

     create table WANT_DATA as

     select     *

     from       YOUR_DATA

     where     IDENTIFIER||DECLARATION in (select distinct ... from IDS);

quit;

Don't know how quick it would be though.  Maybe do  it in pieces, e.g. 1 id per time.

jakarman
Barite | Level 11

Going for that ten of millions (20M? 60M?) of observations and that two variables we are talking about a length of 30 bytes (latin1) that is (60M) 1,8 Gb.
It is big but with a modern pc charged with 8Gb and 8 cores 500Gb dasd (quality laptop) all should be an easy fit for in memory processing.

When you are using SAS at windows it will allow you to do that. Please check memsize (and OS) with: " proc options;run;  put _all_ ;  "

Sequential processing reading once all data is commonly faster than random io unless you are using SSD-s.
A fast solution could building op the key as a Hash. It will deliver a sorted table in memory. Working back Last/Prev should give easy the highest value (case sensitve) for every declaration.
SAS(R) 9.2 Language Reference: Dictionary, Fourth Edition  Only that fifteen long key is needed as a hash when the quality assurance is that the first 14 chars are equal.

In that case it would be one load reading back for selection.  Options:

a/ change data/update in a dataset and doing an output creating a new dataset.

b/ Modifying the hash (deleting the lower values) do a hash table save

Both can be combined with some other processing on the data. eliminating other steps.

Just an annual run? That one also most likely to change annully. maintainability may be more important as speed of processing.

    
           

---->-- ja karman --<-----
Aboiron
Calcite | Level 5

To be more precise I have to do that a hundred times on databases that have beween 500 000 and two millions observations.

Memsize is 2147483648, not sure where to find the OS

It will be just an annual run so of course maintainability is key, but for now I have to run it a lot in order to test some features. Moreover it is only a small part of my program, which is also only a small part of the whole program that has to be run each year, so every time gained is still precious

Patrick
Opal | Level 21

You still haven't told us if your data is stored as SAS files or resides in a data base (and which one). This information is crucial to come up with a "best" solution.

Below sample code is for data stored in SAS files ('s option b).

If your data is stored in a data base then - depending on the data base - using SQL analytical functions would eventually be the a very efficient way ORACLE-BASE - FIRST_VALUE and LAST_VALUE Analytic Functions

options fullstimer;

data have(keep=identifier declaration value);

  attrib identifier length=8 format=best32. declaration length=$9.;

  do identifier=1 to 10000000;

    _stop=ceil(ranuni(1)*10);

    do _i=1 to _stop;

      declaration=cats(identifier,byte(_i+64));

      value+1;

      output;

    end;

  end;

run;

data want (drop=_:);

  set have end=last;

  if _n_=1 then

    do;

      if 0 then

        set have(keep=declaration rename=(declaration=_decl));

      dcl hash h1 (dataset:'have(keep=identifier declaration rename=(declaration=_decl))', hashexp:9);

      _rc=h1.defineKey('identifier');

      _rc=h1.defineData('_decl');

      _rc=h1.defineDone();

    end;

  call missing(_decl);

  _rc=h1.find();

  if declaration>_decl then

    h1.replace(key:identifier,data:declaration);

  if last then

    do _i=1 to _nobs;

      set have nobs=_nobs point=_i;

      call missing(_decl);

      _rc=h1.find();

      if declaration=_decl then output;  

    end;

run;

Aboiron
Calcite | Level 5

I have, at least I think so : "-It is a sas-dedicated database"

That is I have a library pointing to a directory in which are stored my databases as .sas7bdat

Patrick
Opal | Level 21

O.K. - then using a hash approach will be hard to beat. I've loaded 10M records into the hash which consumed around 650MB RAM. So even if your key/data combination consumes more storage space, your volumes are quite a bit lower but your memory setting indicates that you're having up to 2GB at your disposal.

"To be more precise I have to do that a hundred times on databases that have beween 500 000 and two millions observations."

Then you definitely want to go for code as optimized as possible and then "wrap" this into a SAS macro so that you have to maintain the "complexity" in a single place only.

On a side note - and more for actually - the code variant with a second data step and a hash with only a key defined (with 2 variables) consumes 800MB RAM. Not sure why and this is may-be the reason why Paul Dorfman mentioned in another discussion that we need to define a single byte variable as "data" for the hash in such cases.

Patrick
Opal | Level 21

And here Jaap's option a) - I believe. Runs a bit longer and uses a bit more resources but is may-be a bit better to understand/maintain.

options fullstimer;

data have(keep=identifier declaration value);

  attrib identifier length=8 format=best32. declaration length=$9.;

  do identifier=1 to 10000000;

    _stop=ceil(ranuni(1)*10);

    do _i=1 to _stop;

      declaration=cats(identifier,byte(_i+64));

      value+1;

      output;

    end;

  end;

run;

data _null_;

  set have end=last;

  if _n_=1 then

    do;

      if 0 then

        set have(keep=declaration rename=(declaration=_decl));

      dcl hash h1 (dataset:'have(keep=identifier declaration rename=(declaration=_decl))');

      _rc=h1.defineKey('identifier');

      _rc=h1.defineData('identifier','_decl');

      _rc=h1.defineDone();

    end;

  call missing(_decl);

  _rc=h1.find();

  if declaration>_decl then

    h1.replace(key:identifier,data:identifier,data:declaration);

  if last then h1.output(dataset:'selection(rename=(_decl=declaration))');

run;

data want;

  set have;

  if _n_=1 then

  do;

    dcl hash h1 (dataset:'selection');

    _rc=h1.defineKey('identifier','declaration');

    _rc=h1.defineDone();

  end;

  if h1.check()=0 then output;

run;

Aboiron
Calcite | Level 5

Thx, I will look at it tomorrow when more focused.

Patrick
Opal | Level 21

The one thing which makes me thinking is that you say it's about hundreds of tables AND end of year processing. That's normally the time where system resources are the most scarce.

If the hash doesn't get the required memory then the job will fail. I'm not 100% sure that if a memsize of 2GB has been defined that you have this amount of memory guaranteed. Hope someone else can shed some light on this.

I/O and disk space is at the end of the year normally also in high demand but normally the effect of shared I/O is only a decrease in performance and not job failure. So if you want to be on the very safe side then using a sort is eventually better for your situation. Performance will likely be worse due to increased I/O - especially for tables with big records (lots of variables or very long variables).

Below a code option for such an approach. The sample code performs actually quite well - but the record size is also quite low (only one additional 8 byte variable).

options fullstimer;

data have(keep=identifier declaration value);

  attrib identifier length=8 format=best32. declaration length=$9.;

  do identifier=1 to 10000000;

    _stop=ceil(ranuni(1)*10);

    do _i=1 to _stop;

      declaration=cats(identifier,byte(_i+64));

      value+1;

      output;

    end;

  end;

run;

proc sql;

  create view v_have as

    select *

    from have

    order by identifier, declaration

    ;

quit;

data want;

  set v_have;

  by identifier declaration;

  if last.identifier;

run;

LinusH
Tourmaline | Level 20

There is something I don't understand.

You have an annually updated table.

What are the other "hundreds" of programs you wish to run, it sounded like that they were equal/similar?

If this is the "one" type of query you wish to optimize, just store the table in sorted order by identifier and declaration. Then you just need to do simple table scans using set by with last. (or first. depending on table sort order) logic.

This sounds too simple so I presume that I missed something in this conversation?

To speed up table scans, try store it in a SPDE libname, using the fastest disks you can get.

Data never sleeps
Astounding
PROC Star

There are others who know the syntax better than I do, but the best approach is likely to be a hash table.  Load the data into a hash table, using REPLACE for duplicate key values.  Then unload the hash table.

All assuming that you have sufficient memory.  Good luck.

art297
Opal | Level 21

Regardless of which approach you end up using, I don't think your logic is correct and you don't need to write the 3rd file.  Aren't you really just trying to accomplish something like?:

proc sort data=have out=want;

  by identifier descending declaration; /*<-I changed this statement*/

run;

proc sort data=want nodupkey;

  by identifier; /*<-I changed this statement*/

run;


sas-innovate-2024.png

Don't miss out on SAS Innovate - Register now for the FREE Livestream!

Can't make it to Vegas? No problem! Watch our general sessions LIVE or on-demand starting April 17th. Hear from SAS execs, best-selling author Adam Grant, Hot Ones host Sean Evans, top tech journalist Kara Swisher, AI expert Cassie Kozyrkov, and the mind-blowing dance crew iLuminate! Plus, get access to over 20 breakout sessions.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 25 replies
  • 6250 views
  • 6 likes
  • 8 in conversation