Solved: create a macro to iterate through multiple keywords and datasets

river1 · Posted 01-27-2024 02:41 AM

Hello, I have some code to search for keywords in multiple datasets that are created from about 40 merge statements. I've given an example of one of the merge statements and 8 keywords. I have been trying to get the macro to also iterate through three other datasets, for example, LIB2.DATA1, LIB2.DATA2, LIB2.DATA3 and LIB2.DATA4. Although it will take longer to run I'm trying to simplify the output but I've struggled adding in the different datasets. I'd really appreciate some ideas on how to do this.
Many thanks

ods excel file="C:\SAS_output\output_A1.xlsx" options(sheet_name="Sheet1" sheet_interval='none');

%macro SINGLES4(keyword);
data MYLIB1.A1_&keyword ;
  if _n_ = 1 then do;
    dcl hash h1(dataset:'LIB2.DATA1(where=(find(cats(VAR1), "&keyword", "i")))');
    h1.defineKey('VAR2');
    h1.defineDone();
  end;
  set LIB2.DATA1;
  if h1.check() = 0;
run;


title "next search";

proc sort data=MYLIB1.A1_&keyword out=MYLIB1_A1_&keyword;
  by VAR2;
run;

proc sort data=MYLIB4.D4 out=MYLIB4_D4_&keyword;
  by VAR2;
run;

data MYLIB1.A2_&keyword;
  merge MYLIB1_A1_&keyword (in=a) MYLIB4_D4_&keyword (in=b);
  by VAR2;
  if a and b;
  output;
run;

proc sql;
  select count(distinct VAR2) as distinct_A2_&keyword
  from MYLIB1.A2_&keyword;
quit;

[merge and count statements repeated many times]




%mend;

%macro iterate_keywords;
%do i=1 %to 8; 
  %let keyword = %scan(ADF DF QW ER RT TY YU UI, &i); 
  %SINGLES4(&keyword);
%end;
%mend;

%iterate_keywords;

Patrick · Posted 01-27-2024 08:29 PM

Assuming all you want is the result from...

proc sql;
  select count(distinct VAR2) as distinct_A2_&keyword
  from MYLIB1.A2_&keyword;
quit;

...I strongly feel there must be a simpler, better performant and less resource intensive approach.

Your current code will return a count of distinct strings that exist in both tables and match a search term. Is that what you're after?

Is your table MYLIB1.A1_&keyword always the same table where you select rows that match a search term and then compare with one to many other tables?

Are the search terms you are looking for substrings or words in a string or is this the full value stored in the variable you investigate?

I suggest you explain us in sufficient detail what "business problem" you want to solve, what you have and what your desired result is.

Below code is just for illustration how simple things could become depending on your actual requirements.

Below SQL still does a lot of sorting internally so for us to propose something adequate we also need some indication of the data volumes (number of rows in tables) you are dealing with.

/* create sample data */
data work.have_1;
  set sashelp.class;
  if _n_ in (1,3,5,7) then
  output;
  output;
run;

data work.have_2;
  set sashelp.class;
  if _n_ in (1,5,7,9) then
    do;
      output;
      output;
      output;
    end;
run;

%let keyword =ADF DF QW ER RT TY YU UI RE;
data _null_;
  call symputx('keyword2',translate(compbl("&keyword"),'|',' '));
run;

title 'Matching strings sub-set by search terms';
proc sql;
  select name from work.have_1
  where prxmatch("/&keyword2/oi",strip(name))>0
  intersect
  select name from work.have_2
  where prxmatch("/&keyword2/oi",strip(name))>0
  ;
quit;

Above strings exist in both source tables and contain substring RE.

View solution in original post

Kurt_Bremser · Posted 01-27-2024 03:06 AM

This will not work:

dcl hash h1(dataset:'LIB2.DATA1(where=(find(cats(VAR1), "&keyword", "i")))');

The single quotes prevent the resolution of the macro variable. Use double quotes outside and single quotes inside:

dcl hash h1(dataset:"LIB2.DATA1(where=(find(cats(VAR1),'&keyword','i')))");

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

river1 · Posted 01-27-2024 03:14 AM

Thanks for this, I'll change my code.

Would it be possible to extend the macro to also iterate the same code and search for the keywords through multiple datasets?

I am running this so many times as I am also searching for different keywords in another variable plus different variations and it would help to cut down on my pages of code. for example to run the code through LIB2.DATA1 but also through LIB2.DATA2, LIB2.DATA3 and LIB2.DATA4.

Many thanks

Kurt_Bremser · Posted 01-27-2024 03:21 AM

Create a dataset with dataset names.

Use dataset name as a parameter of the outside macro, and as additional parameter of the inside macro.

Replace literal dataset names with the parameter where needed.

Then, call the outside macro from the dataset with CALL EXECUTE.

Don't forget to show the dataset name in the result.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Patrick · Posted 01-27-2024 08:29 PM

Assuming all you want is the result from...

proc sql;
  select count(distinct VAR2) as distinct_A2_&keyword
  from MYLIB1.A2_&keyword;
quit;

...I strongly feel there must be a simpler, better performant and less resource intensive approach.

Your current code will return a count of distinct strings that exist in both tables and match a search term. Is that what you're after?

Is your table MYLIB1.A1_&keyword always the same table where you select rows that match a search term and then compare with one to many other tables?

Are the search terms you are looking for substrings or words in a string or is this the full value stored in the variable you investigate?

I suggest you explain us in sufficient detail what "business problem" you want to solve, what you have and what your desired result is.

Below code is just for illustration how simple things could become depending on your actual requirements.

Below SQL still does a lot of sorting internally so for us to propose something adequate we also need some indication of the data volumes (number of rows in tables) you are dealing with.

/* create sample data */
data work.have_1;
  set sashelp.class;
  if _n_ in (1,3,5,7) then
  output;
  output;
run;

data work.have_2;
  set sashelp.class;
  if _n_ in (1,5,7,9) then
    do;
      output;
      output;
      output;
    end;
run;

%let keyword =ADF DF QW ER RT TY YU UI RE;
data _null_;
  call symputx('keyword2',translate(compbl("&keyword"),'|',' '));
run;

title 'Matching strings sub-set by search terms';
proc sql;
  select name from work.have_1
  where prxmatch("/&keyword2/oi",strip(name))>0
  intersect
  select name from work.have_2
  where prxmatch("/&keyword2/oi",strip(name))>0
  ;
quit;

Above strings exist in both source tables and contain substring RE.

river1 · Posted 01-29-2024 03:49 AM

Thanks for your reply that is helpful. I have a lot of data and perhaps there is a simpler way to deal with this. I will now combine everything into a single dataset rather than the four datasets I had intended and this will reduce the duplication and make it easier for analysis. There are also two smaller datasets with ethnicity /gender/age groups variables and they all have the linking variable VAR2.

I have only been able to use the macro above on the 18 single keywords. I have several other variables that are a combination of keywords (or excluding keywords) so I am extracting these separately. The main dataset (MYLIB2.DATA1) has about 3.5 million rows and it is down to 8 variables but if I was to also merge it with the outputs from the hash objects and the two ethnicity datasets (using the linking variable var2) it would have 60+ variables, which would be large. I’ve not been able to make the hash object add the output to a variable in a single dataset so the best I could do is below. The final merge statement will be a bit messy if there is an easier way to do this it would be helpful. But I will no longer need the merge a and b statement - this was causing a warning, which I think it was because of the multiple observations with the same VAR2 in MYLIB2.DATA1

data TEST.B1 ;
  if _n_= 1 then 
    do;
      declare hash CS (dataset:”MYLIB2.DATA1(where=(
                                   find(cats(VAR1),’XXX’, ‘I’) AND
                                   find(cats(VAR1),’AAA’, ‘I’)
                                   ))”);
      CS.defineKey('VAR2');
      CS.defineDone();
   end;
  set MYLIB.DATA1;
  if CS.check()=0;
run;


proc sql;
    select count(distinct VAR2) as distinct_B1_XA
    from TEST.B1;
quit;


data TEST.B1_1;
  set TEST.B1;
  XA = 1;
run;



data TEST.B2 (compress=yes);
  if _n_= 1 then 
    do;
      dcl hash CS(dataset:"MYLIB.DATA1(where=(
                                   find(cats(VAR1),'AAA', 'i') OR  
                                   find(cats(VAR1),'BBB', 'i')   
                                   ))");
      CS.defineKey('VAR2');
      CS.defineDone();
   end;
  set MYLIB.DATA1;
  if CS.check()=0;
run;


proc sql;
    select count(distinct VAR2) as distinct_B2
    from TEST.B2;
quit;


data TEST.B2_1;
  set TEST.B2;
  AB = 1;
run;



proc sort data=TEST.B2_1 out=TEST_B2_1;
    by VAR2;
run;

proc sort data=TEST.B1_1 out=TEST_B1_1 ;
    by VAR2;
run;

proc sort data=MYLIB.DATA1 out= MYLIB_DATA1;
    by VAR2;
run;


data test.F1;
  merge TEST_B2_1 (in=a) TEST_B1_1 (in=b) MYLIB_DATA1 (in=c);
  by VAR2;
output;

run;


proc delete data=TEST.B1 TEST.B1_1 TEST_B1_1 TEST.B2 TEST.B2_1 TEST_B2_1 MYLIB_DATA1; 
run;

create a macro to iterate through multiple keywords and datasets

Re: create a macro to iterate through multiple keywords and datasets

Re: create a macro to iterate through multiple keywords and datasets

Re: create a macro to iterate through multiple keywords and datasets

Re: create a macro to iterate through multiple keywords and datasets

Re: create a macro to iterate through multiple keywords and datasets

Re: create a macro to iterate through multiple keywords and datasets