BookmarkSubscribeRSS Feed
glcoolj12
Obsidian | Level 7

Hello,

 

I have a list of 17,888 unique ids (IDVAR) from a very large dataset (HAVE1) that I would like to split into several macro variables. I'd like to query based on these unique ids in another very large dataset (HAVE2). I define "large dataset" as having over 200 millions rows. Creating a sub-query would take hours to run which is why I'm creating the macro. I've used the code below when I have fewer unique IDs, which runs beautifully. But now I run into the issue of exceed the maximum macro limit. 

  

PROC SQL NOPRINT;
	SELECT DISTINCT QUOTE(TRIM(idvar))
		INTO :LIST1 SEPARATED BY ","
			FROM have1;
QUIT;

%PUT LIST1= &LIST1;


PROC SQL;
	CREATE TABLE want AS 
		SELECT DISTINCT *,
				FROM have2
					WHERE  idvar IN (&LIST1);
QUIT;

 

I've attempted to use the below code, but I run into issues where the unique IDs get cut-off and additional wording gets input to the list.

 

%let n_per_list=4000 ;

data _null_;
  length idlist $32000;
  length macrolist $1000;
  retain macrolist;
  do i=1 to &n_per_list until (eof);
    set HAVE1 end=eof;
    idlist=catx(',',idlist,QUOTE(TRIM(IDVAR)));
  end;
  listno+1;
  call symputx(cats('paralist',listno),idlist);
  macrolist=catx(',',macrolist,cats('&','paralist',listno));
  call symputx('paralist',macrolist);
run;

%put Paralist=%superq(ParaList);
Paralist=&paralist1,&paralist2,&paralist3,&paralist4,&paralist5
%put &=Paralist;


Any assistance that explains how I can properly split up a long list of IDs into several macro variables, will be much appreciated.

 

 

4 REPLIES 4
Kurt_Bremser
Super User
Data belongs in datasets, not in macro variables.
You can create a format for your ~20000 values, or read them into a hash object, if you want to avoid a simple sub-select.
Mind that there may be other reasons for the bad performance of your sub-select attempt.
Tom
Super User Tom
Super User

Just add a test of the length in the DO loop.

Here is example using names from SASHELP.CLASS and stopping with each individual list is more than 20 characters long.

data have1 ;
 set sashelp.class ;
 rename name=idvar;
run;
data _null_;
  length idlist $32000;
  length macrolist $32000;
  retain macrolist;
  do i=1 by 1 until (eof or length(idlist)>20);
    set have1 end=eof;
    idlist=catx(',',idlist,QUOTE(TRIM(IDVAR)));
  end;
  listno+1;
  call symputx(cats('paralist',listno),idlist);
  macrolist=catx(',',macrolist,cats('&','paralist',listno));
  if eof then call symputx('paralist',macrolist);
run;

%put paralist = %superq(paralist);
%put &=paralist;
39    %put paralist = %superq(paralist);
paralist = &paralist1,&paralist2,&paralist3,&paralist4,&paralist5,&paralist6,&paralist7
40    %put &=paralist;
PARALIST="Alfred","Alice","Barbara","Carol","Henry","James","Jane","Janet","Jeffrey","John","Joyce","Judy",
"Louise","Mary","Philip","Robert","Ronald","Thomas","William"

 

ErikLund_Jensen
Rhodochrosite | Level 12

Hi @glcoolj12 

 

A join would normally be much faster. You could try this instead:

 

proc sql; 
	create table want as
		select have2.* 
		from have2 inner join have1
		on have2.idvar = have1.idvar;
quit;

If it is still too slow, it might be a good idea to create an index on have2, but the outcome is difficult to predict. 

ScottBass
Rhodochrosite | Level 12

In general, macro is orders of magnitude slower than data step, SQL, or RDBMS joins.  Especially if you're writing huge amounts of data to the macro symbol table.

 

Since your data is already in a dataset or table, just use it from there.

 

Following on from @ErikLund_Jensen 's post, I ran the below code as a simple test to see which would run faster:

 

data have;
   do x=1 to 1E7;
      random=ranuni(0);
      output;
   end;
run;

data list;
   do y=1 to 1E8 by 3;
      random=ranuni(0);
      output;
   end;
run;

proc sort data=have;
   by random;
run;
proc sort data=list;
   by random;
run;

%bench(start) 
proc sql _method;
   create table want1 as
   select x
   from   have
   where  x in (
      select y from list
   )
   order by x;
quit;
%bench(end)

%bench(start) 
proc sql _method;
   create table want2 as
   select x
   from   have
   join   list
   on     have.x=list.y
   order by x;
quit;
%bench(end)

want1:  Total time: 00 hours, 02 minutes, 08 seconds 127.50099992752

want2:  Total time: 00 hours, 00 minutes, 41 seconds 41.3870000839233

 

SQL execution methods:

 

want1:

 

      sqxcrta
          sqxsort
              sqxfil
                  sqxsrc( WORK.HAVE )

NOTE: SQL subquery execution methods chosen are:

          sqxsubq
              sqxsrc( WORK.LIST )

want2:

 

Because the data was so narrow and would fit in memory, SQL used a hash join:

 

NOTE: SQL execution methods chosen are:

      sqxcrta
          sqxsort
              sqxjhsh
                  sqxsrc( WORK.LIST )
                  sqxsrc( WORK.HAVE )

Hope this helps!


Please post your question as a self-contained data step in the form of "have" (source) and "want" (desired results).
I won't contribute to your post if I can't cut-and-paste your syntactically correct code into SAS.

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 4 replies
  • 1211 views
  • 1 like
  • 5 in conversation