Solved: Merging/Interleaving data sets with multiple rows using SAS 9.4

Dave13 · Posted 10-15-2023 11:25 PM

Hello. I've been trying to merge several large data sets unsuccessfully. They can be rather large (3 million rows of data). I've been trying to sort the data on the common variables and merge without success. I'm only posting small examples of the data sets. Feel free to point me to reading/manuals to help resolve by coding

Sample code:

PROC SORT DATA=RATIO2;
BY SWISSPROT SEX DOSE PND;
RUN;
PROC SORT DATA=PDMEANS2;
BY SWISSPROT SEX DOSE PND;
RUN;
DATA SINGLE_PD;
MERGE PDMEANS2 RATIO2;
BY SWISSPROT SEX DOSE PND;
RUN;
PROC PRINT DATA=SINGLE_PD;

The 1st data set looks like this:

Obs	SEX	DOSE	PND	SWISSPROT	_TYPE_	_FREQ_	MEAN_AR	MED_AR	N
1	F	0	2	A0A096MJN4	0	6	0.32823	0.32258	6
2	F	0	8	A0A096MJN4	0	6	0.56312	0.57554	6
3	F	0	15	A0A096MJN4	0	6	0.67177	0.65325	6
4	F	3.78	2	A0A096MJN4	0	6	0.32472	0.29637	6
5	F	3.78	8	A0A096MJN4	0	6	0.47661	0.46189	6
6	F	3.78	15	A0A096MJN4	0	6	0.67148	0.69292	6
7	M	0	2	A0A096MJN4	0	6	0.32642	0.34287	6
8	M	0	8	A0A096MJN4	0	6	0.52017	0.52632	6
9	M	0	15	A0A096MJN4	0	6	0.62165	0.55316	6
10	M	3.78	2	A0A096MJN4	0	6	0.3239	0.33203	6
11	M	3.78	8	A0A096MJN4	0	6	0.50382	0.51008	6
12	M	3.78	15	A0A096MJN4	0	6	0.83233	0.86276	6
13	F	0	2	A0A096MJZ0	0	6	1.18446	1.20836	6

The second data set looks like:

Obs	FILE	SWISSPROT	AR1	SEX	PND	DOSE	PVALUE1	FDR1
1	F1	A0A096MJN4	0.323	F	2	0	0	0
2	F1	A0A096MJN4	0.575	F	8	0	8.38E-07	2.35E-06
3	F1	A0A096MJN4	0.652	F	15	0	0.002007	0.007018
Rows	not	included in	post
97	F1	A0A096MJZ0	1.214	F	2	0	0.98516	1

What I'm trying to end up with:

Obs	SEX	DOSE	PND	SWISSPROT	_TYPE_	_FREQ_	MEAN_AR	MED_AR	N	AR1	PVALUE1	FDR1
1	F	0	2	A0A096MJN4	0	6	0.32823	0.32258	6	0.323	0	0
2	F	0	8	A0A096MJN4	0	6	0.56312	0.57554	6	0.575	8.38E-07	2.35E-06
3	F	0	15	A0A096MJN4	0	6	0.67177	0.65325	6	0.652	0.002007	0.007018
4	F	3.78	2	A0A096MJN4	0	6	0.32472	0.29637	6
5	F	3.78	8	A0A096MJN4	0	6	0.47661	0.46189	6
6	F	3.78	15	A0A096MJN4	0	6	0.67148	0.69292	6
7	M	0	2	A0A096MJN4	0	6	0.32642	0.34287	6
8	M	0	8	A0A096MJN4	0	6	0.52017	0.52632	6
9	M	0	15	A0A096MJN4	0	6	0.62165	0.55316	6
10	M	3.78	2	A0A096MJN4	0	6	0.3239	0.33203	6
11	M	3.78	8	A0A096MJN4	0	6	0.50382	0.51008	6
12	M	3.78	15	A0A096MJN4	0	6	0.83233	0.86276	6
13	F	0	2	A0A096MJZ0	0	6	1.18446	1.20836	6	1.214	0.98516	1
14	F	0	8	A0A096MJZ0	0	6	1.27973	1.23149	6

Thaks in advance for any suggestions.

Patrick · Posted 10-16-2023 11:24 PM

@Dave13 A SQL like below will tell you if you've got any such leading blanks in character variables that you're using for the merge.

data RATIO2;
  infile datalines truncover dsd;
  input FILE $ SWISSPROT :$char20. AR1 SEX :$char6. PND DOSE PVALUE1 FDR1;
  datalines;
F1,A0A096MJN4,0.323,F,2,0,0,0
F1,  A0A096MJN4,0.575,F,8,0,8.38E-07,2.35E-06
F1,A0A096MJN4,0.652, F,15,0,0.002007,0.007018,,,,,
F1,A0A096MJZ0,1.214, F,2,0,0.98516,1
;

proc sql;
  select 
    sum(
        case
          when length(swissprot) ne length(left(swissprot)) then 1
          else 0
          end
        ) as n_leading_blk_SWISSPROT,
    sum(
        case
          when length(SEX) ne length(left(SEX)) then 1
          else 0
          end
        ) as n_leading_blk_SEX
  from ratio2
  ;
quit;

View solution in original post

Patrick · Posted 10-16-2023 12:53 AM

From what I understand the code you shared already creates your desired result - except for obs 14 for which there is no matching data in your source tables.

What am I missing?

data PDMEANS2;
  infile datalines truncover;
  input SEX $ DOSE PND SWISSPROT :$20. _TYPE_ _FREQ_ MEAN_AR MED_AR N;
  datalines;
F 0 2 A0A096MJN4 0 6 0.32823 0.32258 6
F 0 8 A0A096MJN4 0 6 0.56312 0.57554 6
F 0 15 A0A096MJN4 0 6 0.67177 0.65325 6
F 3.78 2 A0A096MJN4 0 6 0.32472 0.29637 6
F 3.78 8 A0A096MJN4 0 6 0.47661 0.46189 6
F 3.78 15 A0A096MJN4 0 6 0.67148 0.69292 6
M 0 2 A0A096MJN4 0 6 0.32642 0.34287 6
M 0 8 A0A096MJN4 0 6 0.52017 0.52632 6
M 0 15 A0A096MJN4 0 6 0.62165 0.55316 6
M 3.78 2 A0A096MJN4 0 6 0.3239 0.33203 6
M 3.78 8 A0A096MJN4 0 6 0.50382 0.51008 6
M 3.78 15 A0A096MJN4 0 6 0.83233 0.86276 6
F 0 2 A0A096MJZ0 0 6 1.18446 1.20836 6
;

data RATIO2;
  infile datalines truncover;
  input FILE $ SWISSPROT :$20. AR1 SEX $ PND DOSE PVALUE1 FDR1;
  datalines;
F1 A0A096MJN4 0.323 F 2 0 0 0
F1 A0A096MJN4 0.575 F 8 0 8.38E-07 2.35E-06
F1 A0A096MJN4 0.652 F 15 0 0.002007 0.007018     
F1 A0A096MJZ0 1.214 F 2 0 0.98516 1
;

PROC SORT DATA=RATIO2;
  BY SWISSPROT SEX DOSE PND;
RUN;

PROC SORT DATA=PDMEANS2;
  BY SWISSPROT SEX DOSE PND;
RUN;

DATA SINGLE_PD;
  MERGE PDMEANS2 RATIO2;
  BY SWISSPROT SEX DOSE PND;
RUN;

PROC PRINT DATA=SINGLE_PD;
run;

Dave13 · Posted 10-16-2023 01:07 PM

HI Patrick:
The missing lines were left out for simplicity. They should be filled in with data. I manually cut/paste into an Excel spreadsheet to illustrate what I wanted. I'm including a small example of what I am currently getting. I'm not very proficient at SQL, but have a feeling I may need to improve to solve this issue. Thanks for looking into the question.
Obs FILE SWISSPROT AR1 SEX PND DOSE PVALUE1 FDR1 _TYPE_ _FREQ_ MEAN_AR MED_AR N

1 F1 A0A096MJN4 0.323 F 2 0.00 0.00000 0.00000 0 6 0.32823 0.32258 6
2 A0A096MJN4 . F 8 0.00 . . 0 6 0.56312 0.57554 6
3 A0A096MJN4 . F 15 0.00 . . 0 6 0.67177 0.65325 6
4 A0A096MJN4 . F 2 3.78 . . 0 6 0.32472 0.29637 6
5 A0A096MJN4 . F 8 3.78 . . 0 6 0.47661 0.46189 6
6 A0A096MJN4 . F 15 3.78 . . 0 6 0.67148 0.69292 6
7 A0A096MJN4 . M 2 0.00 . . 0 6 0.32642 0.34287 6
8 A0A096MJN4 . M 8 0.00 . . 0 6 0.52017 0.52632 6
9 A0A096MJN4 . M 15 0.00 . . 0 6 0.62165 0.55316 6
10 A0A096MJN4 . M 2 3.78 . . 0 6 0.32390 0.33203 6
11 A0A096MJN4 . M 8 3.78 . . 0 6 0.50382 0.51008 6
12 A0A096MJN4 . M 15 3.78 . . 0 6 0.83233 0.86276 6
13 F1 A0A096MJZ0 1.214 F 2 0.00 0.98516 1.00000 0 6 1.18446 1.20836 6

ballardw · Posted 10-16-2023 01:52 PM

@Dave13 wrote:
HI Patrick:
The missing lines were left out for simplicity. They should be filled in with data. I manually cut/paste into an Excel spreadsheet to illustrate what I wanted. I'm including a small example of what I am currently getting. I'm not very proficient at SQL, but have a feeling I may need to improve to solve this issue. Thanks for looking into the question.

I am a little confused. "They should be filled in with data." From where? If the other data set does not have matching data where is the filling data supposed to come from?

The only reason I would think of that Proc SQL might be needed is if you have multiple observations in both data sets with the exact same values of the BY statement, and even then you likely need some post processing to deal with the multiple output observations likely to be in the result.

Dave13 · Posted 10-16-2023 02:29 PM

Yes, the input dataset has an AR1 PVALUE1 FDR1 for each combination. Sorry for the confusing data set examples. I've been trying to keep the example simple, and apparently things are confusing. I get a correct merge for the 1st line of each Swissprot, (sex=F, PND=2, Dose=0 condition) then missing values for the other combinations of sex, pnd, and dose. However, data exists for all of the other combinations of sex, pnd, and dose. I'll continue to work on it this afternoon myself. Thanks for your time and comments.

ballardw · Posted 10-16-2023 04:35 PM

If you think you matching data but MERGE doesn't use the values then you may have issues with the actual values. If a variable is character you may need to remove leading spaces in values. Many ways of looking at data, such as proc print or report, by default will not show leading spaces in the results window because of procedure defaults:

data example;
   var='    1';output;
   var='   1'; output;
   var=' 1'; output;
   var='1'; output;
run;

proc freq data=example;
run;

Proc print data=example; 
run;

The data defines a variable with up to 5 blanks preceding the numeral. Note that even though Proc freq counts the values as different they display as apparently the same.

Proc print just left justifies the values.

If you are matching on a numeric value sometime you need to consider rounding to force values to be identical. Also the format assigned may hide the difference in values but the match in merge can tell they are not numerically equal. So a likely candidate for this problem related to Merge is your DOSE variable. What you showed may well have non-displayed digits because of a format like 5.2 which will round 3.776 to 3.78 (or 3.775 or 3.7745 or ....). Best4. might do the same as well.

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

That may help with showing your data in the form of a data step that we can further examine for issues related to merge.

You can subset your original data sets to a small example. You may not need all of the variables if you have many more more than your previous examples used.

Patrick · Posted 10-16-2023 11:24 PM

@Dave13 A SQL like below will tell you if you've got any such leading blanks in character variables that you're using for the merge.

data RATIO2;
  infile datalines truncover dsd;
  input FILE $ SWISSPROT :$char20. AR1 SEX :$char6. PND DOSE PVALUE1 FDR1;
  datalines;
F1,A0A096MJN4,0.323,F,2,0,0,0
F1,  A0A096MJN4,0.575,F,8,0,8.38E-07,2.35E-06
F1,A0A096MJN4,0.652, F,15,0,0.002007,0.007018,,,,,
F1,A0A096MJZ0,1.214, F,2,0,0.98516,1
;

proc sql;
  select 
    sum(
        case
          when length(swissprot) ne length(left(swissprot)) then 1
          else 0
          end
        ) as n_leading_blk_SWISSPROT,
    sum(
        case
          when length(SEX) ne length(left(SEX)) then 1
          else 0
          end
        ) as n_leading_blk_SEX
  from ratio2
  ;
quit;

Dave13 · Posted 10-17-2023 08:56 AM

@Patrick Thanks for the SQL code. I am unfortunately not very practiced with SQL, but will have to practice up. This example is very useful and will solve current and future issues. Thanks for all the help.

Patrick · Posted 10-17-2023 09:21 PM

@Dave13 wrote:

@Patrick Thanks for the SQL code. I am unfortunately not very practiced with SQL, but will have to practice up. This example is very useful and will solve current and future issues. Thanks for all the help.

You can also use a SAS datastep for the same but getting some SQL expertise is certainly worth the effort.

data check(keep=n_leading_blk_SWISSPROT n_leading_blk_SEX);
  set ratio2 end=last;
  retain n_leading_blk_SWISSPROT n_leading_blk_SEX 0;
  if length(swissprot) ne length(left(swissprot)) then n_leading_blk_SWISSPROT+1;
  if length(SEX) ne length(left(SEX)) then n_leading_blk_SEX+1;
  if last then output;
run;

proc print data=check;
run;

Merging/Interleaving data sets with multiple rows using SAS 9.4

Re: Merging/Interleaving data sets with multiple rows using SAS 9.4

Re: Merging/Interleaving data sets with multiple rows using SAS 9.4

Re: Merging/Interleaving data sets with multiple rows using SAS 9.4

Re: Merging/Interleaving data sets with multiple rows using SAS 9.4

Re: Merging/Interleaving data sets with multiple rows using SAS 9.4

Re: Merging/Interleaving data sets with multiple rows using SAS 9.4

Re: Merging/Interleaving data sets with multiple rows using SAS 9.4

Re: Merging/Interleaving data sets with multiple rows using SAS 9.4

Re: Merging/Interleaving data sets with multiple rows using SAS 9.4

Registration is open

SAS Training: Just a Click Away