BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
Luke3
Obsidian | Level 7

Hi, I have two datasets: a dataset of codes and a dataset of people. I need to read the dataset of codes and, for each code, scan the dataset of people and assign it to some people according to some criteria. For example:

 

- read code 1

    - assing it to John, Mark and Carla

- read code 2

    - assign it to Peter, Rita and Katia

- read code 3

    - assign it to Dave, Sue and Carl

... and so on for all codes

 

How to do it?

 

EDIT: Please notice that the accepted solution must be modified as I wrote in my answer to it

1 ACCEPTED SOLUTION

Accepted Solutions
RichardDeVen
Barite | Level 11

that match L1.gender, L1.ageclass,

indicates join criteria of a group based on gender and ageclass.

 

You can create views that compute a pair number, based on sequential order within gender/ageclass, for use in a SQL join.

 

data one_v;
 set one;
 by gender ageclass notsorted;
 if first.ageclass then pair=0; else pair+1;
run;

data two_v;
  set two;
  by gender ageclass notsorted;
  if first.gender then seq=0; else seq+1;
  pair = floor(seq/2);
  rowseq + 1;
run;

proc sql;
  create table want as
  select two.subset, two.gender, two.code, two.ageclass, one.code as assigned_code 
  from two_v as two
  left join one_v as one
  on one.gender=two.gender 
   & one.ageclass=two.ageclass 
   & one.pair=two.pair
  order by rowseq
  ;

View solution in original post

16 REPLIES 16
PeterClemmensen
Tourmaline | Level 20

@Luke3 Hi and welcome to the SAS Community 🙂

 

Are you able to provide some data that resembles your actual data? And what you want the desired result to look like? Makes it much easier to provide usable code for you.

Luke3
Obsidian | Level 7

Dataset1 :

subset gender code ageclass

1            M        001   20-29

1            M        002   20-29

1            F         003   30-39

 

Dataset2 :

subset gender code ageclass

2            M        004   20-29

2            M        005   20-29

2            M        006   20-29

2            M        007   20-29

2            F        008   30-39

2            F        009   30-39

2            F        010   30-39

2            F        011   20-29

 

I expect as a result Dataset2 (or a new dataset) modified as follows:

 

Dataset2 :

subset gender code ageclass assigned_code

2            M        004   20-29     001

2            M        005   20-29     001

2            M        006   20-29     002

2            M        007   20-29     002

2            F        008   30-39     003

2            F        009   30-39     003

2            F        010   30-39

2            F        011   20-29

 

The algorithm must do this:

 

- read the first line from dataset1 (lets call it L1)

- look for the first two lines of dataset2 that match L1.gender, L1.ageclass, have assiged_code empty, and assign L1.code to assigned_code

 

- read the second line from dataset1 (lets call it L2)

- look for the first two lines of dataset2 that match L2.gender, L2.ageclass, have assiged_code empty, and assign L2.code to assigned_code

 

and so on for all the lines of dataset1. The subset column just identifies the dataset.

 

 

Astounding
PROC Star

Unfortunately, all I can come up with is cumbersome.  But it should work.

 

Split the first data set into two (or more) data sets, depending on the maximum number of observations that might have the same identifiers.

 

proc sort data=dataset1;
   by gender ageclass;
run;

data subset1 subset2;
   set dataset1;
   by gender ageclass;
   if first.ageclass then output subset1;
   else output subset2;
   rename code = assigned_code;
   drop subset;  /* not clear why it is in there in the first place */
run;

Assuming that each subset now contains a unique observation for each GENDER AGECLASS combination, combine each with DATASET2.

 

proc sort data=dataset2;
   by gender ageclass;
run;

data want1 remainder1;
   merge subset1 dataset2 (in=keepme);
   by gender ageclass;
   if keepme;
   if first.ageclass then counter=1;
   else counter + 1;
   if counter <= 2 then output want1;
   else output remainder1;
drop counter; run;

That gives you 2 observations for the first set of DATASET1 values, plus a second data set with all the DATASET2 observations not yet selected.

 

Just reapply the same logic for the second set of DATASET1 values:

data want2 remainder2;
   merge subset2 remainder1 (in=keepme);
   by gender ageclass;
   if keepme;
   if first.ageclass then counter = 1;
   else counter + 1;
   if counter <= 2 then output want2;
   else output remainder2;
   drop counter;
run;

If you would like, put the outputs together:

data want;
   set want1 want2 remainder2;
   by gender ageclass;
run;

It's untested code, but has the right ideas incorporated.  It will be easy enough to fix if necessary.

Luke3
Obsidian | Level 7

Hi Astounding,

 

I tested your code, but it does the wrong thing. It only keeps distinct combinations of gender, ageclass from dataset1, but no, we must keep all lines of dataset1 and assign their code to first two (or n) lines of dataset2 that match gender, ageclass and don't have an assigned code yet. I guess we can't use merge with duplicates, can we?

Astounding
PROC Star

Glad to hear that you actually tested it.  You would be amazed at the number of posters who don't test a proposed solution, yet decide that it will or won't work even though they don't know what the answer should be.  At any rate ...

 

The idea is supposed to be that you split out DATASET1 into unique observations.  When there are duplicates, the first of the duplicates goes into SUBSET1.  Then the second of the duplicates goes into SUBSET2.  When you combine SUBSET1 with DATASET2, it is true that you don't get all the combinations you need.  But you do get some of them.  Then when you take the remaining observations that have not yet been selected, and combine them with SUBSET2, you get the rest of the combinations.

 

The example assumes that two subsets are enough, that you never have more than 2 observations for the same combination of GENDER and AGECLASS.

 

Are we on the same page?

Patrick
Opal | Level 21

Below logic works with your sample data.

data ds1;
  input (subset gender code ageclass) ($);
  datalines;
1 M 001 20-29
1 M 002 20-29
1 F 003 30-39
;

data ds2;
  input (subset gender code ageclass) ($);
  datalines;
2 M 004 20-29
2 M 005 20-29
2 M 006 20-29
2 M 007 20-29
2 F 008 30-39
2 F 009 30-39
2 F 010 30-39
2 F 011 20-29
;

proc sort data=ds1;  
  by gender ageclass code;
run;
data ds1_n;
  set ds1;
  keep gender ageclass code row_ind;
  rename code=assigned_code;
  row_ind=1;
  output;
  row_ind=2;
  output;
run;

proc sort data=ds2;  
  by gender ageclass code;
run;

data want;
  merge ds2 ds1_n;
  by gender ageclass;
  if row_ind=lag(row_ind) then call missing(assigned_code);
  drop row_ind;
run;

proc sort data=want;
  by code;
run;
proc print data=want;
run;

Capture.JPG

RichardDeVen
Barite | Level 11

that match L1.gender, L1.ageclass,

indicates join criteria of a group based on gender and ageclass.

 

You can create views that compute a pair number, based on sequential order within gender/ageclass, for use in a SQL join.

 

data one_v;
 set one;
 by gender ageclass notsorted;
 if first.ageclass then pair=0; else pair+1;
run;

data two_v;
  set two;
  by gender ageclass notsorted;
  if first.gender then seq=0; else seq+1;
  pair = floor(seq/2);
  rowseq + 1;
run;

proc sql;
  create table want as
  select two.subset, two.gender, two.code, two.ageclass, one.code as assigned_code 
  from two_v as two
  left join one_v as one
  on one.gender=two.gender 
   & one.ageclass=two.ageclass 
   & one.pair=two.pair
  order by rowseq
  ;
Luke3
Obsidian | Level 7

Hi RichardADeVenezia,

 

I tested your code and it works, but with corrections. Datasets need to be sorted and the sequence in set 2 must start over for each group gender,ageclass. So the correct code is:

 

proc sort data=one;
 by gender ageclass;
run;

proc sort data=two;
 by gender ageclass
run;

data one_v;
 set one;
 by gender ageclass;
 if first.ageclass then pair=0; else pair+1;
run;

data two_v;
  set two;
  by gender ageclass;
  if first.ageClass then seq=0; else seq+1;
  pair = floor(seq/2);
  rowseq + 1;
run;

proc sql;
  create table want as
  select two.subset, two.gender, two.code, two.ageclass, one.code as assigned_code 
  from two_v as two
  left join one_v as one
  on one.gender=two.gender 
   & one.ageclass=two.ageclass 
   & one.pair=two.pair
  order by rowseq
  ;

I will mark your post as the solution but readers mind to read this post too because without the corrections the algorithm doesn't work.

 

I'll use this post to share a solution I elaborated myself and hopefully to get your comments on it:

 

/* inner loop, on dataset two */
%macro loop_two(code_v,gender_v,ageclass_v);
    data two (drop = counter);
        set two;
		length assigned_code $25;
		if _n_ = 1 then 
			counter = 0;
		 if (counter<4 & assigned_code = ' ' & gender = "&gender_v" & ageclass = "&ageclass_v") then
			do;	
				assigned_code = "&code_v";
				counter+1;
			end;
    run;
%mend;

/* outer loop, on dataset one */
data _null_;
	set one;
	call execute('%loop_two('||code||','||gender||','||ageclass||')'  );
run;

It works but it's slow, because it calls a macro for each observation in dataset one. I wonder if there is a faster way to do it with a double loop... A way to build an array in memory with a single read from dataset one, maybe with macro arrays? But macro arrays can only get a single column of a dataset, they can't take the whole dataset as a matrix, can they?

Tom
Super User Tom
Super User

Yikes. I haven't really been able to follow what you are trying to do but my advice is to use the code that Richard has given you.

 

If you want to code in SAS like it was a low level language where you have explicitly move from observation to observation you can, but in that case you should avoid using the macro processor.  Just write actual SAS code to do what you want.

Patrick
Opal | Level 21

Or here another approach which returns the data as per your sample.

data ds1;
  input (subset gender code ageclass) ($);
  datalines;
1 M 001 20-29
1 M 002 20-29
1 F 003 30-39
;

data ds2;
  input (subset gender code ageclass) ($);
  datalines;
2 M 004 20-29
2 M 005 20-29
2 M 006 20-29
2 M 007 20-29
2 F 008 30-39
2 F 009 30-39
2 F 010 30-39
2 F 011 20-29
;

data want;

  if _n_=1 then
    do;
      if 0 then set ds2 ds1(keep=code rename=(code=assigned_code));
      dcl hash h1(multidata:'y');
      h1.defineKey('gender','ageclass');
      h1.defineData('assigned_code');
      h1.defineDone();
      do until(last);
        set ds1(keep=gender ageclass code rename=(code=assigned_code)) end=last;
        h1.add();
        h1.add();
      end;
    end;

  call missing(assigned_code);

  set ds2;
  if h1.find()=0 then h1.removedup();

run;

proc print data=want;
run;

 

Or here a variation for using the hash object. That's my personal favorite not only because it doesn't require to duplicate rows loaded into the hash but also because it would easily allow to use the matching row in the hash as many time as you like (so 2, 3, 4 times etc.)

data want(drop=_:);

  if _n_=1 then
    do;
      if 0 then set ds2 ds1(keep=code rename=(code=assigned_code));
      retain _i 0;
      dcl hash h1(multidata:'y', suminc:'_i',
          dataset:'ds1(keep=gender ageclass code rename=(code=assigned_code))');
      h1.defineKey('gender','ageclass');
      h1.defineData('assigned_code');
      h1.defineDone();
      _i=1;
    end;

  call missing(assigned_code);

  set ds2;
  if h1.find()=0 then
    do;
      h1.sumdup(sum: _sum_i);
      if _sum_i>=2 then h1.removedup();
    end;

run;

 

Luke3
Obsidian | Level 7

Hi Patrick, I tested your codes, all three, and they work. Great to learn about the hash objects, which I didn't know existed.

s_lassen
Meteorite | Level 14

The easy way may be to sort and merge, e.g.:

proc sort data=people;
  by crit1 crit2 crit3;
run;

proc sort data=codes;
  by crit1 crit2 crit3;
run;

data want;
  merge people(in=OK) codes(in=found);
  if OK;
  if not found then error 'Criteria not found in codes';
run;

 

One alternative possibility is to put an index on the criteria variables in the codes table:

proc sql;
  create index idx on codes(crit1,crit2,crit3);
quit;

You can then use that to find the codes your want:

data want;
  set people;
  set codes key=idx/unique;
  if _IORC_ then do; 
    put 'Criteria not found in codes';
    call missing(code);
    end;
run;

The UNIQUE option on SET with KEY= makes the lookup reread the data if there are two equal sets of criteria after each other. I put in the code to set CODE missing, because variables read from the lookup dataset are retained; if a key (set of criteria) is not found, the CODE value is not set to missing automatically.

Luke3
Obsidian | Level 7

Hi, this solution would perform a join on matching fields, so it wouldn't work as the algorithm I better explained above, would it? Criteria variables are not unique and I need to match only a given number of poeple per code, not all (see algorithm above).

ballardw
Super User

@Luke3 wrote:

Hi, this solution would perform a join on matching fields, so it wouldn't work as the algorithm I better explained above, would it? Criteria variables are not unique and I need to match only a given number of poeple per code, not all (see algorithm above).


Provide example data with the "people". They can be dummies by your example data does not include anything related to specific people.

Or if so you have hidden it.

 

So show a complete worked example identifying the "people" aspect.

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 16 replies
  • 2205 views
  • 6 likes
  • 8 in conversation