Hello. I've been trying to merge several large data sets unsuccessfully. They can be rather large (3 million rows of data). I've been trying to sort the data on the common variables and merge without success. I'm only posting small examples of the data sets. Feel free to point me to reading/manuals to help resolve by coding
Sample code:
PROC SORT DATA=RATIO2;
BY SWISSPROT SEX DOSE PND;
RUN;
PROC SORT DATA=PDMEANS2;
BY SWISSPROT SEX DOSE PND;
RUN;
DATA SINGLE_PD;
MERGE PDMEANS2 RATIO2;
BY SWISSPROT SEX DOSE PND;
RUN;
PROC PRINT DATA=SINGLE_PD;
The 1st data set looks like this:
Obs | SEX | DOSE | PND | SWISSPROT | _TYPE_ | _FREQ_ | MEAN_AR | MED_AR | N |
1 | F | 0 | 2 | A0A096MJN4 | 0 | 6 | 0.32823 | 0.32258 | 6 |
2 | F | 0 | 8 | A0A096MJN4 | 0 | 6 | 0.56312 | 0.57554 | 6 |
3 | F | 0 | 15 | A0A096MJN4 | 0 | 6 | 0.67177 | 0.65325 | 6 |
4 | F | 3.78 | 2 | A0A096MJN4 | 0 | 6 | 0.32472 | 0.29637 | 6 |
5 | F | 3.78 | 8 | A0A096MJN4 | 0 | 6 | 0.47661 | 0.46189 | 6 |
6 | F | 3.78 | 15 | A0A096MJN4 | 0 | 6 | 0.67148 | 0.69292 | 6 |
7 | M | 0 | 2 | A0A096MJN4 | 0 | 6 | 0.32642 | 0.34287 | 6 |
8 | M | 0 | 8 | A0A096MJN4 | 0 | 6 | 0.52017 | 0.52632 | 6 |
9 | M | 0 | 15 | A0A096MJN4 | 0 | 6 | 0.62165 | 0.55316 | 6 |
10 | M | 3.78 | 2 | A0A096MJN4 | 0 | 6 | 0.3239 | 0.33203 | 6 |
11 | M | 3.78 | 8 | A0A096MJN4 | 0 | 6 | 0.50382 | 0.51008 | 6 |
12 | M | 3.78 | 15 | A0A096MJN4 | 0 | 6 | 0.83233 | 0.86276 | 6 |
13 | F | 0 | 2 | A0A096MJZ0 | 0 | 6 | 1.18446 | 1.20836 | 6 |
The second data set looks like:
Obs | FILE | SWISSPROT | AR1 | SEX | PND | DOSE | PVALUE1 | FDR1 |
1 | F1 | A0A096MJN4 | 0.323 | F | 2 | 0 | 0 | 0 |
2 | F1 | A0A096MJN4 | 0.575 | F | 8 | 0 | 8.38E-07 | 2.35E-06 |
3 | F1 | A0A096MJN4 | 0.652 | F | 15 | 0 | 0.002007 | 0.007018 |
Rows | not | included in | post | |||||
97 | F1 | A0A096MJZ0 | 1.214 | F | 2 | 0 | 0.98516 | 1 |
What I'm trying to end up with:
Obs | SEX | DOSE | PND | SWISSPROT | _TYPE_ | _FREQ_ | MEAN_AR | MED_AR | N | AR1 | PVALUE1 | FDR1 |
1 | F | 0 | 2 | A0A096MJN4 | 0 | 6 | 0.32823 | 0.32258 | 6 | 0.323 | 0 | 0 |
2 | F | 0 | 8 | A0A096MJN4 | 0 | 6 | 0.56312 | 0.57554 | 6 | 0.575 | 8.38E-07 | 2.35E-06 |
3 | F | 0 | 15 | A0A096MJN4 | 0 | 6 | 0.67177 | 0.65325 | 6 | 0.652 | 0.002007 | 0.007018 |
4 | F | 3.78 | 2 | A0A096MJN4 | 0 | 6 | 0.32472 | 0.29637 | 6 | |||
5 | F | 3.78 | 8 | A0A096MJN4 | 0 | 6 | 0.47661 | 0.46189 | 6 | |||
6 | F | 3.78 | 15 | A0A096MJN4 | 0 | 6 | 0.67148 | 0.69292 | 6 | |||
7 | M | 0 | 2 | A0A096MJN4 | 0 | 6 | 0.32642 | 0.34287 | 6 | |||
8 | M | 0 | 8 | A0A096MJN4 | 0 | 6 | 0.52017 | 0.52632 | 6 | |||
9 | M | 0 | 15 | A0A096MJN4 | 0 | 6 | 0.62165 | 0.55316 | 6 | |||
10 | M | 3.78 | 2 | A0A096MJN4 | 0 | 6 | 0.3239 | 0.33203 | 6 | |||
11 | M | 3.78 | 8 | A0A096MJN4 | 0 | 6 | 0.50382 | 0.51008 | 6 | |||
12 | M | 3.78 | 15 | A0A096MJN4 | 0 | 6 | 0.83233 | 0.86276 | 6 | |||
13 | F | 0 | 2 | A0A096MJZ0 | 0 | 6 | 1.18446 | 1.20836 | 6 | 1.214 | 0.98516 | 1 |
14 | F | 0 | 8 | A0A096MJZ0 | 0 | 6 | 1.27973 | 1.23149 | 6 |
Thaks in advance for any suggestions.
@Dave13 A SQL like below will tell you if you've got any such leading blanks in character variables that you're using for the merge.
data RATIO2;
infile datalines truncover dsd;
input FILE $ SWISSPROT :$char20. AR1 SEX :$char6. PND DOSE PVALUE1 FDR1;
datalines;
F1,A0A096MJN4,0.323,F,2,0,0,0
F1, A0A096MJN4,0.575,F,8,0,8.38E-07,2.35E-06
F1,A0A096MJN4,0.652, F,15,0,0.002007,0.007018,,,,,
F1,A0A096MJZ0,1.214, F,2,0,0.98516,1
;
proc sql;
select
sum(
case
when length(swissprot) ne length(left(swissprot)) then 1
else 0
end
) as n_leading_blk_SWISSPROT,
sum(
case
when length(SEX) ne length(left(SEX)) then 1
else 0
end
) as n_leading_blk_SEX
from ratio2
;
quit;
From what I understand the code you shared already creates your desired result - except for obs 14 for which there is no matching data in your source tables.
What am I missing?
data PDMEANS2;
infile datalines truncover;
input SEX $ DOSE PND SWISSPROT :$20. _TYPE_ _FREQ_ MEAN_AR MED_AR N;
datalines;
F 0 2 A0A096MJN4 0 6 0.32823 0.32258 6
F 0 8 A0A096MJN4 0 6 0.56312 0.57554 6
F 0 15 A0A096MJN4 0 6 0.67177 0.65325 6
F 3.78 2 A0A096MJN4 0 6 0.32472 0.29637 6
F 3.78 8 A0A096MJN4 0 6 0.47661 0.46189 6
F 3.78 15 A0A096MJN4 0 6 0.67148 0.69292 6
M 0 2 A0A096MJN4 0 6 0.32642 0.34287 6
M 0 8 A0A096MJN4 0 6 0.52017 0.52632 6
M 0 15 A0A096MJN4 0 6 0.62165 0.55316 6
M 3.78 2 A0A096MJN4 0 6 0.3239 0.33203 6
M 3.78 8 A0A096MJN4 0 6 0.50382 0.51008 6
M 3.78 15 A0A096MJN4 0 6 0.83233 0.86276 6
F 0 2 A0A096MJZ0 0 6 1.18446 1.20836 6
;
data RATIO2;
infile datalines truncover;
input FILE $ SWISSPROT :$20. AR1 SEX $ PND DOSE PVALUE1 FDR1;
datalines;
F1 A0A096MJN4 0.323 F 2 0 0 0
F1 A0A096MJN4 0.575 F 8 0 8.38E-07 2.35E-06
F1 A0A096MJN4 0.652 F 15 0 0.002007 0.007018
F1 A0A096MJZ0 1.214 F 2 0 0.98516 1
;
PROC SORT DATA=RATIO2;
BY SWISSPROT SEX DOSE PND;
RUN;
PROC SORT DATA=PDMEANS2;
BY SWISSPROT SEX DOSE PND;
RUN;
DATA SINGLE_PD;
MERGE PDMEANS2 RATIO2;
BY SWISSPROT SEX DOSE PND;
RUN;
PROC PRINT DATA=SINGLE_PD;
run;
@Dave13 wrote:
HI Patrick:
The missing lines were left out for simplicity. They should be filled in with data. I manually cut/paste into an Excel spreadsheet to illustrate what I wanted. I'm including a small example of what I am currently getting. I'm not very proficient at SQL, but have a feeling I may need to improve to solve this issue. Thanks for looking into the question.
I am a little confused. "They should be filled in with data." From where? If the other data set does not have matching data where is the filling data supposed to come from?
The only reason I would think of that Proc SQL might be needed is if you have multiple observations in both data sets with the exact same values of the BY statement, and even then you likely need some post processing to deal with the multiple output observations likely to be in the result.
Yes, the input dataset has an AR1 PVALUE1 FDR1 for each combination. Sorry for the confusing data set examples. I've been trying to keep the example simple, and apparently things are confusing. I get a correct merge for the 1st line of each Swissprot, (sex=F, PND=2, Dose=0 condition) then missing values for the other combinations of sex, pnd, and dose. However, data exists for all of the other combinations of sex, pnd, and dose. I'll continue to work on it this afternoon myself. Thanks for your time and comments.
If you think you matching data but MERGE doesn't use the values then you may have issues with the actual values. If a variable is character you may need to remove leading spaces in values. Many ways of looking at data, such as proc print or report, by default will not show leading spaces in the results window because of procedure defaults:
data example; var=' 1';output; var=' 1'; output; var=' 1'; output; var='1'; output; run; proc freq data=example; run; Proc print data=example; run;
The data defines a variable with up to 5 blanks preceding the numeral. Note that even though Proc freq counts the values as different they display as apparently the same.
Proc print just left justifies the values.
If you are matching on a numeric value sometime you need to consider rounding to force values to be identical. Also the format assigned may hide the difference in values but the match in merge can tell they are not numerically equal. So a likely candidate for this problem related to Merge is your DOSE variable. What you showed may well have non-displayed digits because of a format like 5.2 which will round 3.776 to 3.78 (or 3.775 or 3.7745 or ....). Best4. might do the same as well.
Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.
That may help with showing your data in the form of a data step that we can further examine for issues related to merge.
You can subset your original data sets to a small example. You may not need all of the variables if you have many more more than your previous examples used.
@Dave13 A SQL like below will tell you if you've got any such leading blanks in character variables that you're using for the merge.
data RATIO2;
infile datalines truncover dsd;
input FILE $ SWISSPROT :$char20. AR1 SEX :$char6. PND DOSE PVALUE1 FDR1;
datalines;
F1,A0A096MJN4,0.323,F,2,0,0,0
F1, A0A096MJN4,0.575,F,8,0,8.38E-07,2.35E-06
F1,A0A096MJN4,0.652, F,15,0,0.002007,0.007018,,,,,
F1,A0A096MJZ0,1.214, F,2,0,0.98516,1
;
proc sql;
select
sum(
case
when length(swissprot) ne length(left(swissprot)) then 1
else 0
end
) as n_leading_blk_SWISSPROT,
sum(
case
when length(SEX) ne length(left(SEX)) then 1
else 0
end
) as n_leading_blk_SEX
from ratio2
;
quit;
@Patrick Thanks for the SQL code. I am unfortunately not very practiced with SQL, but will have to practice up. This example is very useful and will solve current and future issues. Thanks for all the help.
@Dave13 wrote:
@Patrick Thanks for the SQL code. I am unfortunately not very practiced with SQL, but will have to practice up. This example is very useful and will solve current and future issues. Thanks for all the help.
You can also use a SAS datastep for the same but getting some SQL expertise is certainly worth the effort.
data check(keep=n_leading_blk_SWISSPROT n_leading_blk_SEX);
set ratio2 end=last;
retain n_leading_blk_SWISSPROT n_leading_blk_SEX 0;
if length(swissprot) ne length(left(swissprot)) then n_leading_blk_SWISSPROT+1;
if length(SEX) ne length(left(SEX)) then n_leading_blk_SEX+1;
if last then output;
run;
proc print data=check;
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.