Solved: Merging multiple variables with a common dataset

JJIN · Posted 09-21-2018 09:27 AM

Hi,

Would appreciate if anyone could help a beginner like me with the following.

I have a dataset A which has the IDs and names of the bosses (Top_ID, Top_Name) and their subordinates (A1_ID to A4_ID).

data A;
   input Top_ID $ Top_Name $ A1_ID $ A2_ID $ A3_ID $ A4_ID $;
   datalines;
00001 James 12345 12344 12346 12340
00001 James 12345 12347 12348
00001 James 12349
00002 Amy 
00003 Jenny 11111 11112 12345
00003 Jenny 11113
00005 Thomas 10101 10102 11112
00006 Ken 00002
00007 Paul 00005 10101 10102 11112
;`

Then I have dataset B which has the ID of all employees, their years in service and their name.

I would like to merge in the details in B to A while retaining the current structure of A.

data Principal;
   input ID $ Name $ Years $ ;
   datalines;
00001 James 31
12345 Whitney 8
12344 Hannah 2
12346 Lois 1
12340 Mon 5
12347 Wendy
12348 Ara 1
12349 Chen 2
00002 Amy 11
00003 Jenny 5
11111 Oliver 5
11112 Melissa 1
11113 Jamie 1
00005 Thomas 20
10101 Nana 2
10102 Mandy 4
00006 Ken 4
00007 Paul 2
;`

I know it is possible to do using proc sql or data step merge, but is there a faster way if I have a large database with a more complex structure?

Thanks in advance!

FreelanceReinh · Posted 09-23-2018 09:03 AM

@JJIN wrote:

Just one more question. If my Principal dataset now has both numeric and characters, how would that change?

To keep it simple, let's use a variant of dataset Principal in which YEARS is a numeric variable.

As mentioned in my explanations, we can't put character variables (Name1, Name2, ...) and numeric variables (now Year1, Year2, ...) into the same array. However, we can create one array (call it VC [better than CV, which is the name of a SAS function]) for all character variables and one array (call it VN) for all numeric variables. The corresponding lists of variable names, CVARLIST and NVARLIST, are created in the same way as VARLIST and IDLIST (see first data step below). Note that VARLIST is still useful for specifying the order of variables in the PDV.

/* Create lists of variable names (A1_ID, Name1, Year1, A2_ID, ...),
   numeric variables (Year1, Year2, ...), 
   non-ID character variables (Name1, Name2, ...)
   and ID variables (A1_ID, A2_ID, ...) */

data _null_;
length varlist $150 cvarlist nvarlist idlist $50;
if 0 then set a;
array a[*] a1_id--a4_id;
do i=1 to dim(a);
  varlist=catx(' ', varlist, vname(a[i]), cats('Name', i), cats('Year', i));
  cvarlist=catx(' ', cvarlist, cats('Name', i));
  nvarlist=catx(' ', nvarlist, cats('Year', i));
  idlist=catx(' ', idlist, vname(a[i]));
end;
call symputx('varlist', varlist);
call symputx('cvarlist', cvarlist);
call symputx('nvarlist', nvarlist);
call symputx('idlist', idlist);
stop;
run;

/* Populate variables Name1, Year1, Name2, ... with values from PRINCIPAL */

data final(drop=i id name years);
if _n_=1 then do;
  if 0 then set a(keep=Top:);
  retain &varlist;
  array vc[*] $ &cvarlist;
  array vn[*] &nvarlist;
  array a[*] $ &idlist;
  dcl hash h(dataset:'principal');
  h.definekey('id');
  h.definedata('name', 'years');
  h.definedone();
  if 0 then set principal;
end;
set a;
call missing(of vc[*], of vn[*]);
do i=1 to dim(a);
  if a[i] ne ' ' then do;
    if h.find(key:a[i])=0 then do;
      vc[i]=name;
      vn[i]=coalesce(years,0);
    end;
  end;
end;
run;

The structure of the second data step hasn't changed much either. New is the RETAIN statement. Its sole purpose is to ensure the desired order of variables in the PDV (allowing for the mixture of variable types unlike the former array v ... statement). Actually, this is only a side effect of the RETAIN statement, whose main effect is to prevent variables from being initialized to missing at the beginning of each data step iteration. The latter can be dangerous if assignment statements are executed only conditionally, because not executing the assignment statement now means: "retain the value from the previous iteration of the data step" (typically, the value from the previous observation in the dataset). This is exactly what we don't want to happen to our NAMEi and YEARi variables, whose assignments are indeed conditional (see IF conditions). To avoid this, we now use CALL MISSING in each iteration of the data step and explicitly reset these variables to missing.

The RETAIN statement doesn't specify variable lengths and not even variable types. Therefore, the dollar signs (and potentially length specifications like $40) in the definitions of character arrays VC and A are mandatory.

It is one of the strengths of the SAS hash object that it accepts arbitrary combinations of character and numeric variables as key or data items. Hence, no need for code changes regarding hash object H.

The assignment statements in the last part of the data step have become a bit simpler thanks to the split arrays, but with your real data you will have more of them and the array indices will differ if you have more than one group of variables per type. Numeric variable YEARS requires the numeric counterpart of COALESCEC, i.e. COALESCE.

Further remarks:

As you see in the definition of ARRAY VC, all character variables would have the same length. This might be unfortunate (a waste of disk space) with your real data if you have, say, several very short variables (such as Y/N flags), but also long variables for names or addresses. In this case you may want to split the array further and define, e.g., VshortC as $1 and VlongC as $40.
With still more programming effort one could make the program fully data-driven: The number of ID variables (here: 4) and the names, types and lengths of the "data variables" (here: Name, Year) could be retrieved from DICTIONARY.COLUMNS and used dynamically in order to avoid hardcoding.

View solution in original post

LinusH · Posted 09-21-2018 09:34 AM

If your concern is performance, you could try to use hash table look-up.

Data never sleeps

JJIN · Posted 09-22-2018 01:38 AM

I'm not familiar with hash, are you able to help with the codes so I can learn from there?

FreelanceReinh · Posted 09-21-2018 10:43 AM

Hi @JJIN,

What do you want the resulting dataset to look like? Please note that the current structure of A is not ideal: Obviously, the number of subordinates per boss varies so that the dataset is full of missing values.

JJIN · Posted 09-22-2018 01:33 AM

Hi, the resulting dataset should be something like this:

data final;
   input Top_ID $ Top_Name $ A1_ID $ Name1 $ Year1 $ A2_ID $ Name2 $ Year2 $ A3_ID $ Name3 $ Year3 $ A4_ID $ Name4 $ Year4 $;
   datalines;
00001 James 12345 Whitney 8 12344 Hannah  2 12346 Lois  1 12340 Mon 5
00001 James 12345 Whitney 8 12347 Wendy 0 12348 Ara 1
00001 James 12349 Chen 2
00002 Amy 
00003 Jenny 11111 Oliver  5 11112 Melissa  1 12345 Whitney 8
00003 Jenny 11113 Jamie 1
00005 Thomas 10101 Nana  2 10102 Mandy  4 11112 Melissa 1
00006 Ken 00002 Amy 11
00007 Paul 00005 Thomas 20 10101 Nana  2 10102 Mandy  4 11112 Melissa 1
;`

Understand the current structure is not ideal and there are other brute force ways to do it, but I am wondering if there is a more efficient way to do while maintaining the structure of A?

FreelanceReinh · Posted 09-22-2018 08:01 AM

Thanks @JJIN for providing the desired output dataset for the example. Here is how you could obtain it programmatically (using the hash object):

/* Create lists of variable names (A1_ID, Name1, Year1, A2_ID, ...)
   and specifically of ID variables (A1_ID, A2_ID, ...) */

data _null_;
length varlist $150 idlist $50;
if 0 then set a;
array a[*] a1_id--a4_id;
do i=1 to dim(a);
  varlist=catx(' ', varlist, vname(a[i]), cats('Name', i), cats('Year', i));
  idlist=catx(' ', idlist, vname(a[i]));
end;
call symputx('varlist', varlist);
call symputx('idlist', idlist);
stop;
run;

/* Populate variables Name1, Year1, Name2, ... with values from PRINCIPAL */

data final(drop=i id name years);
if _n_=1 then do;
  if 0 then set a(keep=Top:);
  array v[*] $ &varlist;
  array a[*] &idlist;
  dcl hash h(dataset:'principal');
  h.definekey('id');
  h.definedata('name', 'years');
  h.definedone();
  if 0 then set principal;
end;
set a;
do i=1 to dim(a);
  if a[i] ne ' ' then do;
    if h.find(key:a[i])=0 then do;
      v[3*i-1]=name;
      v[3*i]=coalescec(years,'0');
    end;
  end;
end;
run;

Explanations will follow later today (CEST).

Please note that your data steps did not properly create the sample datasets, but with one extra statement after the DATA statement they did:

infile datalines missover;

FreelanceReinh · Posted 09-22-2018 12:41 PM

Explanations of the suggested code, as promised:

The first data step creates two macro variables VARLIST and IDLIST (by means of the CALL SYMPUTX routine), which receive their values from data step character variables of the same names. Data step variable IDLIST, in turn, is simply a concatenation of the ID variable names A1_ID, A2_ID, ..., separated by blanks (see first argument of the CATX function): "A1_ID A2_ID A3_ID A4_ID".

The names could be constructed as concatenations of "A", the number (1, 2, ...) and "_ID", but I used another way to get them: the VNAME function applied to elements of the array A, which I had defined using the variable list a1_id--a4_id based on the variables' positions in the program data vector (PDV). As a prerequisite I loaded the variable names of dataset A into the PDV by means of the SET statement. Since we don't need data from A at this point, but just variable names, this SET statement doesn't need to be executed. We only need its effect at compile time. Hence, "if 0 then set ..." is sufficient.

Please note that with more suitable variable names (with numbers at the end, e.g. a_id1, a_id2, ...) the code could be simplified by using variable lists like a_id1-a_id4 in place of &idlist.

Data step variable VARLIST contains the same names as IDLIST plus the names Name1, Name2, ... and Year1, Year2, ... (constructed using the CATS function) in the order you specified for dataset FINAL.

The lengths of VARLIST and IDLIST (150 and 50, resp.) are more than sufficient for the variable names in the example data, but you might need to increase them for your real data (if there are many more variables or with longer names).

The second data step starts with a block of statements which are executed only once at the very beginning of the step's execution (_n_=1 indicates the first iteration of the data step) or which take effect during compile time already:
As this is the data step to create dataset FINAL, we'd like to ensure that the PDV is created in the desired order. The first SET statement puts variables Top_ID and Top_Name from dataset A into the first two positions in the PDV (which happens at compile time, without reading data). The remaining variables are declared by the first ARRAY statement which uses the list &VARLIST we prepared in the preceding data step. This takes advantage of the fact that all variables listed in &VARLIST are character variables. A mixture of character and numeric variables could not form a single array. For your real data it will be important to define a sufficient length in this ARRAY statement, e.g.

array v[*] $40 &varlist;

if the longest value of the variables in the list had length <=40. Currently, the default length 8 is used.

The second ARRAY statement doesn't contribute anything to the PDV because the variables in &IDLIST are a subset of those in &VARLIST. This is why neither a dollar sign nor a length specification is needed here. Please note that the list a1_id--a4_id (which was fine in the first data step) would now be incorrect because, e.g., Name1 and Year1 occur between A1_ID and A2_ID in the PDV.

ARRAY statements are declarative, so they actually don't belong into a DO-END block, but the declaration esp. of array V directly after the SET statement makes sense in view of the variable order we want to achieve.

Next is the declaration of hash object H, which is a temporary, memory-resident table very similar to dataset PRINCIPAL. The values of variable ID in that dataset serve as unique keys of the hash table (they must not contain duplicates). The information we want to retrieve for an ID are the "data items" of the hash object: They receive their values from variables NAME and YEARS of dataset PRINCIPAL. For the communication between hash object and data step we need these three variables (ID, NAME, YEARS) in the PDV, which is accomplished by the SET statement at the end of the DO-END block. From dataset FINAL, however, they will be excluded by means of the DROP= dataset option.

So, before any observation from dataset A is read, the entire content of dataset PRINCIPAL has been loaded into the hash table (memory permitting).

The central part of this data step is the third SET statement. In each iteration of the data step it reads one observation from dataset A, thus "retaining the current structure of A," as requested.

Once an observation is read, variables A1_ID, ..., A4_ID (alias a[1], ..., a[4]) contain between 0 and 4 IDs of "subordinates". Each of the four ID variables is handled in an iteration of the DO loop (i=1, 2, 3, 4=dim(a)). Empty IDs (there are many of them, as mentioned earlier) can be skipped, hence the IF condition a[i] ne ' '. The corresponding NAMEi and YEARi values will remain missing. For non-empty IDs a search in the hash table is performed using the FIND method of H. If the ID a[i] is found, the FIND method returns value 0 (as a return code) so that the two assignment statements in the inner DO-END block are executed. The major effect of a successful FIND operation is that the data values are retrieved into the corresponding data step variables, NAME and YEARS. And these values are to be stored in the appropriate NAMEi and YEARi variables.

Every third variable in the list

A1_ID Name1 Year1 A2_ID Name2 Year2 A3_ID Name3 Year3 A4_ID Name4 Year4

is a YEARi variable and is preceded by a NAMEi variable. This explains the array indices used, 3*i and 3*i-1, respectively.

I noticed that you assigned YEARS=0 in dataset FINAL in one case where YEARS was missing in dataset PRINCIPAL (ID='12347'). This is why I used the COALESCEC function in the assigment statement populating the YEARi variables.

HTH

JJIN · Posted 09-22-2018 10:18 PM

Thanks @FreelanceReinh for the detailed explanation!

Just one more question. If my Principal dataset now has both numeric and characters, how would that change?

FreelanceReinh · Posted 09-23-2018 09:03 AM

@JJIN wrote:

Just one more question. If my Principal dataset now has both numeric and characters, how would that change?

To keep it simple, let's use a variant of dataset Principal in which YEARS is a numeric variable.

As mentioned in my explanations, we can't put character variables (Name1, Name2, ...) and numeric variables (now Year1, Year2, ...) into the same array. However, we can create one array (call it VC [better than CV, which is the name of a SAS function]) for all character variables and one array (call it VN) for all numeric variables. The corresponding lists of variable names, CVARLIST and NVARLIST, are created in the same way as VARLIST and IDLIST (see first data step below). Note that VARLIST is still useful for specifying the order of variables in the PDV.

/* Create lists of variable names (A1_ID, Name1, Year1, A2_ID, ...),
   numeric variables (Year1, Year2, ...), 
   non-ID character variables (Name1, Name2, ...)
   and ID variables (A1_ID, A2_ID, ...) */

data _null_;
length varlist $150 cvarlist nvarlist idlist $50;
if 0 then set a;
array a[*] a1_id--a4_id;
do i=1 to dim(a);
  varlist=catx(' ', varlist, vname(a[i]), cats('Name', i), cats('Year', i));
  cvarlist=catx(' ', cvarlist, cats('Name', i));
  nvarlist=catx(' ', nvarlist, cats('Year', i));
  idlist=catx(' ', idlist, vname(a[i]));
end;
call symputx('varlist', varlist);
call symputx('cvarlist', cvarlist);
call symputx('nvarlist', nvarlist);
call symputx('idlist', idlist);
stop;
run;

/* Populate variables Name1, Year1, Name2, ... with values from PRINCIPAL */

data final(drop=i id name years);
if _n_=1 then do;
  if 0 then set a(keep=Top:);
  retain &varlist;
  array vc[*] $ &cvarlist;
  array vn[*] &nvarlist;
  array a[*] $ &idlist;
  dcl hash h(dataset:'principal');
  h.definekey('id');
  h.definedata('name', 'years');
  h.definedone();
  if 0 then set principal;
end;
set a;
call missing(of vc[*], of vn[*]);
do i=1 to dim(a);
  if a[i] ne ' ' then do;
    if h.find(key:a[i])=0 then do;
      vc[i]=name;
      vn[i]=coalesce(years,0);
    end;
  end;
end;
run;

The structure of the second data step hasn't changed much either. New is the RETAIN statement. Its sole purpose is to ensure the desired order of variables in the PDV (allowing for the mixture of variable types unlike the former array v ... statement). Actually, this is only a side effect of the RETAIN statement, whose main effect is to prevent variables from being initialized to missing at the beginning of each data step iteration. The latter can be dangerous if assignment statements are executed only conditionally, because not executing the assignment statement now means: "retain the value from the previous iteration of the data step" (typically, the value from the previous observation in the dataset). This is exactly what we don't want to happen to our NAMEi and YEARi variables, whose assignments are indeed conditional (see IF conditions). To avoid this, we now use CALL MISSING in each iteration of the data step and explicitly reset these variables to missing.

The RETAIN statement doesn't specify variable lengths and not even variable types. Therefore, the dollar signs (and potentially length specifications like $40) in the definitions of character arrays VC and A are mandatory.

It is one of the strengths of the SAS hash object that it accepts arbitrary combinations of character and numeric variables as key or data items. Hence, no need for code changes regarding hash object H.

The assignment statements in the last part of the data step have become a bit simpler thanks to the split arrays, but with your real data you will have more of them and the array indices will differ if you have more than one group of variables per type. Numeric variable YEARS requires the numeric counterpart of COALESCEC, i.e. COALESCE.

Further remarks:

As you see in the definition of ARRAY VC, all character variables would have the same length. This might be unfortunate (a waste of disk space) with your real data if you have, say, several very short variables (such as Y/N flags), but also long variables for names or addresses. In this case you may want to split the array further and define, e.g., VshortC as $1 and VlongC as $40.
With still more programming effort one could make the program fully data-driven: The number of ID variables (here: 4) and the names, types and lengths of the "data variables" (here: Name, Year) could be retrieved from DICTIONARY.COLUMNS and used dynamically in order to avoid hardcoding.

Merging multiple variables with a common dataset

Re: Merging multiple variables with a common dataset

Re: Merging multiple variables with a common dataset

Re: Merging multiple variables with a common dataset

Re: Merging multiple variables with a common dataset

Re: Merging multiple variables with a common dataset

Re: Merging multiple variables with a common dataset

Re: Merging multiple variables with a common dataset

Re: Merging multiple variables with a common dataset

Re: Merging multiple variables with a common dataset

Classroom Training Available!