BookmarkSubscribeRSS Feed
ertr
Quartz | Level 8

Hello everybody,

 

I have sample code as below,

 

Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover; 
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;

PROC SORT DATA=HAVE(KEEP=Variable ID) OUT=WANT ;
	BY ID;
RUN;
PROC TRANSPOSE DATA=WANT OUT=WANT(drop=_NAME_)
	PREFIX=Test;
	BY ID;
	VAR Variable;
RUN; QUIT;

 But in real life, when I want to do this process with million values, it takes a lot of time so I want your help to maker it faster. As you can see there is  multiplexing problem in my code and I could solve it with PROC SORT and PROC TRANSPOSE. Can somebody help me do it better?

Thanks

5 REPLIES 5
novinosrin
Tourmaline | Level 20

Just a million?or over a million?

 

Or millions?

 

Or 100's of millions or  billion?

 

 

And is it always sets of 3 for each BY group like what your sample suggests?

mkeintz
PROC Star

The only reason you are  sorting is because dataset HAVE is not sorted.  If it's a big data set then a single DATA step using a hash object may be faster than SORT then TRANSPOSE:

 


data template;
  set have (keep=id);
  length test1-test30 $32;
  t=0;
  stop;
run;
data _null_;
  retain max_t;
  if _n_=1 then do;
   if 0 then set template;
   declare hash h (dataset:'template',ordered:'A');
     h.definekey('id');
     h.definedata(all:'Y');
     h.definedone();
  end;
  array tst {*} $32 test:;
  set have  end=end_of_have;
  if h.find()=0 then t=t+1;
  else t=1;
  tst{t}=variable;
  h.replace();
  max_t=max(max_t,t);
  if end_of_have then do;
    want_descriptor=cats('WANT (keep=ID test1-test',max_t,')');
    rc=h.output(dataset:want_descriptor);
  end;
run;
  1. The data set TEMPLATE is made just to avoid a lot of typing when declaring the hash object.   Typing   all:'Y' is a lot simpler than typing 'TEST1','TEST2',.....,'TEST30'.
  2. Put a number of TEST variables in TEMPLATE that is sure to exceed the number of tests for any ID.
  3. Note that the order of test values within an id will be preserved, even if  they are initially separated by other ID's.
  4. The want_descriptor is the means of dynaimically controlling the number of variables in WANT.  It will be the minimum necessary, just like proc transpose.

 

The value here is to save on disk input/output, at the expense of memory for the hash object.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------
ertr
Quartz | Level 8

Thank you for your invaluable response. It seems work, however, I am not familiar with Hash code, can you please explain more deeply(step by step maybe)? @mkeintz @novinosrin  On the other hand, our data sets 50 million rows. On the other hand, are there any alternative methods except from Hash? 

 

Thanks

Ksharp
Super User
Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover; 
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;

proc sql noprint;
create index id on have;
select max(n) into : n
 from (select count(*) as n from have group by id);
quit;
data want;
 do i=1 by 1 until(last.id);
  set have;
  by id;
  array x{*} $ 80 test1-test%left(&n);
  x{i}=variable;
 end;
 drop i variable;
run;
ballardw
Super User

@ertr wrote:

Hello everybody,

 

I have sample code as below,

 

Data Have;
Length ID 8 Variable $ 32;
Infile Datalines Missover; 
Input ID Variable;
Datalines;
1 AAA
1 BBB
1 CCC
2 AAA
2 BBB
2 CCC
;

PROC SORT DATA=HAVE(KEEP=Variable ID) OUT=WANT ;
	BY ID;
RUN;
PROC TRANSPOSE DATA=WANT OUT=WANT(drop=_NAME_)
	PREFIX=Test;
	BY ID;
	VAR Variable;
RUN; QUIT;

 But in real life, when I want to do this process with million values, it takes a lot of time so I want your help to maker it faster. As you can see there is  multiplexing problem in my code and I could solve it with PROC SORT and PROC TRANSPOSE. Can somebody help me do it better?

Thanks


Your example data implies that the values of variable may repeat for all your ID values (though possibly in a different order). Is that actually the case? If so selecting the distinct values of the variable, transposing and join to the id list might be another approach.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 5 replies
  • 795 views
  • 1 like
  • 5 in conversation