SAS Programming

Luke3 · Posted 03-03-2020 03:49 AM

Hi, I have two datasets: a dataset of codes and a dataset of people. I need to read the dataset of codes and, for each code, scan the dataset of people and assign it to some people according to some criteria. For example:

- read code 1

- assing it to John, Mark and Carla

- read code 2

- assign it to Peter, Rita and Katia

- read code 3

- assign it to Dave, Sue and Carl

... and so on for all codes

How to do it?

EDIT: Please notice that the accepted solution must be modified as I wrote in my answer to it

RichardDeVen · Posted 03-05-2020 11:26 AM

that match L1.gender, L1.ageclass,

indicates join criteria of a group based on gender and ageclass.

You can create views that compute a pair number, based on sequential order within gender/ageclass, for use in a SQL join.

data one_v;
 set one;
 by gender ageclass notsorted;
 if first.ageclass then pair=0; else pair+1;
run;

data two_v;
  set two;
  by gender ageclass notsorted;
  if first.gender then seq=0; else seq+1;
  pair = floor(seq/2);
  rowseq + 1;
run;

proc sql;
  create table want as
  select two.subset, two.gender, two.code, two.ageclass, one.code as assigned_code 
  from two_v as two
  left join one_v as one
  on one.gender=two.gender 
   & one.ageclass=two.ageclass 
   & one.pair=two.pair
  order by rowseq
  ;

View solution in original post

PeterClemmensen · Posted 03-03-2020 03:51 AM

@Luke3 Hi and welcome to the SAS Community 🙂

Are you able to provide some data that resembles your actual data? And what you want the desired result to look like? Makes it much easier to provide usable code for you.

The DATA to DATA Step Macro
Blog: SASnrd

Luke3 · Posted 03-03-2020 04:21 AM

Dataset1 :

subset gender code ageclass

1 M 001 20-29

1 M 002 20-29

1 F 003 30-39

Dataset2 :

subset gender code ageclass

2 M 004 20-29

2 M 005 20-29

2 M 006 20-29

2 M 007 20-29

2 F 008 30-39

2 F 009 30-39

2 F 010 30-39

2 F 011 20-29

I expect as a result Dataset2 (or a new dataset) modified as follows:

Dataset2 :

subset gender code ageclass assigned_code

2 M 004 20-29 001

2 M 005 20-29 001

2 M 006 20-29 002

2 M 007 20-29 002

2 F 008 30-39 003

2 F 009 30-39 003

2 F 010 30-39

2 F 011 20-29

The algorithm must do this:

- read the first line from dataset1 (lets call it L1)

- look for the first two lines of dataset2 that match L1.gender, L1.ageclass, have assiged_code empty, and assign L1.code to assigned_code

- read the second line from dataset1 (lets call it L2)

- look for the first two lines of dataset2 that match L2.gender, L2.ageclass, have assiged_code empty, and assign L2.code to assigned_code

and so on for all the lines of dataset1. The subset column just identifies the dataset.

Astounding · Posted 03-03-2020 11:07 AM

Unfortunately, all I can come up with is cumbersome. But it should work.

Split the first data set into two (or more) data sets, depending on the maximum number of observations that might have the same identifiers.

proc sort data=dataset1;
   by gender ageclass;
run;

data subset1 subset2;
   set dataset1;
   by gender ageclass;
   if first.ageclass then output subset1;
   else output subset2;
   rename code = assigned_code;
   drop subset;  /* not clear why it is in there in the first place */
run;

Assuming that each subset now contains a unique observation for each GENDER AGECLASS combination, combine each with DATASET2.

proc sort data=dataset2;
   by gender ageclass;
run;

data want1 remainder1;
   merge subset1 dataset2 (in=keepme);
   by gender ageclass;
   if keepme;
   if first.ageclass then counter=1;
   else counter + 1;
   if counter <= 2 then output want1;
   else output remainder1;   drop counter;
run;

That gives you 2 observations for the first set of DATASET1 values, plus a second data set with all the DATASET2 observations not yet selected.

Just reapply the same logic for the second set of DATASET1 values:

data want2 remainder2;
   merge subset2 remainder1 (in=keepme);
   by gender ageclass;
   if keepme;
   if first.ageclass then counter = 1;
   else counter + 1;
   if counter <= 2 then output want2;
   else output remainder2;
   drop counter;
run;

If you would like, put the outputs together:

data want;
   set want1 want2 remainder2;
   by gender ageclass;
run;

It's untested code, but has the right ideas incorporated. It will be easy enough to fix if necessary.

Luke3 · Posted 03-06-2020 06:06 AM

Hi Astounding,

I tested your code, but it does the wrong thing. It only keeps distinct combinations of gender, ageclass from dataset1, but no, we must keep all lines of dataset1 and assign their code to first two (or n) lines of dataset2 that match gender, ageclass and don't have an assigned code yet. I guess we can't use merge with duplicates, can we?

Astounding · Posted 03-06-2020 07:08 PM

Glad to hear that you actually tested it. You would be amazed at the number of posters who don't test a proposed solution, yet decide that it will or won't work even though they don't know what the answer should be. At any rate ...

The idea is supposed to be that you split out DATASET1 into unique observations. When there are duplicates, the first of the duplicates goes into SUBSET1. Then the second of the duplicates goes into SUBSET2. When you combine SUBSET1 with DATASET2, it is true that you don't get all the combinations you need. But you do get some of them. Then when you take the remaining observations that have not yet been selected, and combine them with SUBSET2, you get the rest of the combinations.

The example assumes that two subsets are enough, that you never have more than 2 observations for the same combination of GENDER and AGECLASS.

Are we on the same page?

Patrick · Posted 03-06-2020 09:58 PM

Below logic works with your sample data.

data ds1;
  input (subset gender code ageclass) ($);
  datalines;
1 M 001 20-29
1 M 002 20-29
1 F 003 30-39
;

data ds2;
  input (subset gender code ageclass) ($);
  datalines;
2 M 004 20-29
2 M 005 20-29
2 M 006 20-29
2 M 007 20-29
2 F 008 30-39
2 F 009 30-39
2 F 010 30-39
2 F 011 20-29
;

proc sort data=ds1;  
  by gender ageclass code;
run;
data ds1_n;
  set ds1;
  keep gender ageclass code row_ind;
  rename code=assigned_code;
  row_ind=1;
  output;
  row_ind=2;
  output;
run;

proc sort data=ds2;  
  by gender ageclass code;
run;

data want;
  merge ds2 ds1_n;
  by gender ageclass;
  if row_ind=lag(row_ind) then call missing(assigned_code);
  drop row_ind;
run;

proc sort data=want;
  by code;
run;
proc print data=want;
run;

RichardDeVen · Posted 03-05-2020 11:26 AM

that match L1.gender, L1.ageclass,

indicates join criteria of a group based on gender and ageclass.

You can create views that compute a pair number, based on sequential order within gender/ageclass, for use in a SQL join.

data one_v;
 set one;
 by gender ageclass notsorted;
 if first.ageclass then pair=0; else pair+1;
run;

data two_v;
  set two;
  by gender ageclass notsorted;
  if first.gender then seq=0; else seq+1;
  pair = floor(seq/2);
  rowseq + 1;
run;

proc sql;
  create table want as
  select two.subset, two.gender, two.code, two.ageclass, one.code as assigned_code 
  from two_v as two
  left join one_v as one
  on one.gender=two.gender 
   & one.ageclass=two.ageclass 
   & one.pair=two.pair
  order by rowseq
  ;

Luke3 · Posted 03-11-2020 10:31 AM

Hi RichardADeVenezia,

I tested your code and it works, but with corrections. Datasets need to be sorted and the sequence in set 2 must start over for each group gender,ageclass. So the correct code is:

proc sort data=one;
 by gender ageclass;
run;

proc sort data=two;
 by gender ageclass
run;

data one_v;
 set one;
 by gender ageclass;
 if first.ageclass then pair=0; else pair+1;
run;

data two_v;
  set two;
  by gender ageclass;
  if first.ageClass then seq=0; else seq+1;
  pair = floor(seq/2);
  rowseq + 1;
run;

proc sql;
  create table want as
  select two.subset, two.gender, two.code, two.ageclass, one.code as assigned_code 
  from two_v as two
  left join one_v as one
  on one.gender=two.gender 
   & one.ageclass=two.ageclass 
   & one.pair=two.pair
  order by rowseq
  ;

I will mark your post as the solution but readers mind to read this post too because without the corrections the algorithm doesn't work.

I'll use this post to share a solution I elaborated myself and hopefully to get your comments on it:

/* inner loop, on dataset two */
%macro loop_two(code_v,gender_v,ageclass_v);
    data two (drop = counter);
        set two;
		length assigned_code $25;
		if _n_ = 1 then 
			counter = 0;
		 if (counter<4 & assigned_code = ' ' & gender = "&gender_v" & ageclass = "&ageclass_v") then
			do;	
				assigned_code = "&code_v";
				counter+1;
			end;
    run;
%mend;

/* outer loop, on dataset one */
data _null_;
	set one;
	call execute('%loop_two('||code||','||gender||','||ageclass||')'  );
run;

It works but it's slow, because it calls a macro for each observation in dataset one. I wonder if there is a faster way to do it with a double loop... A way to build an array in memory with a single read from dataset one, maybe with macro arrays? But macro arrays can only get a single column of a dataset, they can't take the whole dataset as a matrix, can they?

Tom · Posted 03-11-2020 10:59 AM

Yikes. I haven't really been able to follow what you are trying to do but my advice is to use the code that Richard has given you.

If you want to code in SAS like it was a low level language where you have explicitly move from observation to observation you can, but in that case you should avoid using the macro processor. Just write actual SAS code to do what you want.

Patrick · Posted 03-06-2020 10:19 PM

Or here another approach which returns the data as per your sample.

data ds1;
  input (subset gender code ageclass) ($);
  datalines;
1 M 001 20-29
1 M 002 20-29
1 F 003 30-39
;

data ds2;
  input (subset gender code ageclass) ($);
  datalines;
2 M 004 20-29
2 M 005 20-29
2 M 006 20-29
2 M 007 20-29
2 F 008 30-39
2 F 009 30-39
2 F 010 30-39
2 F 011 20-29
;

data want;

  if _n_=1 then
    do;
      if 0 then set ds2 ds1(keep=code rename=(code=assigned_code));
      dcl hash h1(multidata:'y');
      h1.defineKey('gender','ageclass');
      h1.defineData('assigned_code');
      h1.defineDone();
      do until(last);
        set ds1(keep=gender ageclass code rename=(code=assigned_code)) end=last;
        h1.add();
        h1.add();
      end;
    end;

  call missing(assigned_code);

  set ds2;
  if h1.find()=0 then h1.removedup();

run;

proc print data=want;
run;

Or here a variation for using the hash object. That's my personal favorite not only because it doesn't require to duplicate rows loaded into the hash but also because it would easily allow to use the matching row in the hash as many time as you like (so 2, 3, 4 times etc.)

data want(drop=_:);

  if _n_=1 then
    do;
      if 0 then set ds2 ds1(keep=code rename=(code=assigned_code));
      retain _i 0;
      dcl hash h1(multidata:'y', suminc:'_i',
          dataset:'ds1(keep=gender ageclass code rename=(code=assigned_code))');
      h1.defineKey('gender','ageclass');
      h1.defineData('assigned_code');
      h1.defineDone();
      _i=1;
    end;

  call missing(assigned_code);

  set ds2;
  if h1.find()=0 then
    do;
      h1.sumdup(sum: _sum_i);
      if _sum_i>=2 then h1.removedup();
    end;

run;

Luke3 · Posted 03-16-2020 10:42 AM

Hi Patrick, I tested your codes, all three, and they work. Great to learn about the hash objects, which I didn't know existed.

s_lassen · Posted 03-03-2020 04:05 AM

The easy way may be to sort and merge, e.g.:

proc sort data=people;
  by crit1 crit2 crit3;
run;

proc sort data=codes;
  by crit1 crit2 crit3;
run;

data want;
  merge people(in=OK) codes(in=found);
  if OK;
  if not found then error 'Criteria not found in codes';
run;

One alternative possibility is to put an index on the criteria variables in the codes table:

proc sql;
  create index idx on codes(crit1,crit2,crit3);
quit;

You can then use that to find the codes your want:

data want;
  set people;
  set codes key=idx/unique;
  if _IORC_ then do; 
    put 'Criteria not found in codes';
    call missing(code);
    end;
run;

The UNIQUE option on SET with KEY= makes the lookup reread the data if there are two equal sets of criteria after each other. I put in the code to set CODE missing, because variables read from the lookup dataset are retained; if a key (set of criteria) is not found, the CODE value is not set to missing automatically.

Luke3 · Posted 03-03-2020 04:39 AM

Hi, this solution would perform a join on matching fields, so it wouldn't work as the algorithm I better explained above, would it? Criteria variables are not unique and I need to match only a given number of poeple per code, not all (see algorithm above).

ballardw · Posted 03-03-2020 11:14 AM

@Luke3 wrote:

Hi, this solution would perform a join on matching fields, so it wouldn't work as the algorithm I better explained above, would it? Criteria variables are not unique and I need to match only a given number of poeple per code, not all (see algorithm above).

Provide example data with the "people". They can be dummies by your example data does not include anything related to specific people.

Or if so you have hidden it.

So show a complete worked example identifying the "people" aspect.

SAS Programming

How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Re: How to read a dataset while looping on another

Follow Us

What is...

SAS Programming

Join us for our biggest event of the year!

SAS Training: Just a Click Away

Follow Us

What is...