BookmarkSubscribeRSS Feed
☑ This topic is solved. Need further help from the community? Please sign in and ask a new question.
Dave13
Fluorite | Level 6

Hello. I've been trying to merge several large data sets unsuccessfully. They can be rather large (3 million rows of data). I've been trying to sort the data on the common variables and merge without success. I'm only posting small examples of the data sets. Feel free to point me to reading/manuals to help resolve by coding

Sample code:

PROC SORT DATA=RATIO2;
BY SWISSPROT SEX DOSE PND;
RUN;
PROC SORT DATA=PDMEANS2;
BY SWISSPROT SEX DOSE PND;
RUN;
DATA SINGLE_PD;
MERGE PDMEANS2 RATIO2;
BY SWISSPROT SEX DOSE PND;
RUN;
PROC PRINT DATA=SINGLE_PD;

 

The 1st data set looks like this:

ObsSEXDOSEPNDSWISSPROT_TYPE__FREQ_MEAN_ARMED_ARN
1F02A0A096MJN4060.328230.322586
2F08A0A096MJN4060.563120.575546
3F015A0A096MJN4060.671770.653256
4F3.782A0A096MJN4060.324720.296376
5F3.788A0A096MJN4060.476610.461896
6F3.7815A0A096MJN4060.671480.692926
7M02A0A096MJN4060.326420.342876
8M08A0A096MJN4060.520170.526326
9M015A0A096MJN4060.621650.553166
10M3.782A0A096MJN4060.32390.332036
11M3.788A0A096MJN4060.503820.510086
12M3.7815A0A096MJN4060.832330.862766
13F02A0A096MJZ0061.184461.208366

 

The second data set looks like:

ObsFILESWISSPROTAR1SEXPNDDOSEPVALUE1FDR1
1F1A0A096MJN40.323F2000
2F1A0A096MJN40.575F808.38E-072.35E-06
3F1A0A096MJN40.652F1500.0020070.007018
Rowsnotincluded inpost     
97F1A0A096MJZ01.214F200.985161

 

What I'm trying to end up with:

ObsSEXDOSEPNDSWISSPROT_TYPE__FREQ_MEAN_ARMED_ARNAR1PVALUE1FDR1
1F02A0A096MJN4060.328230.3225860.32300
2F08A0A096MJN4060.563120.5755460.5758.38E-072.35E-06
3F015A0A096MJN4060.671770.6532560.6520.0020070.007018
4F3.782A0A096MJN4060.324720.296376   
5F3.788A0A096MJN4060.476610.461896   
6F3.7815A0A096MJN4060.671480.692926   
7M02A0A096MJN4060.326420.342876   
8M08A0A096MJN4060.520170.526326   
9M015A0A096MJN4060.621650.553166   
10M3.782A0A096MJN4060.32390.332036   
11M3.788A0A096MJN4060.503820.510086   
12M3.7815A0A096MJN4060.832330.862766   
13F02A0A096MJZ0061.184461.2083661.2140.985161
14F08A0A096MJZ0061.279731.231496   

 

Thaks in advance for any suggestions.

1 ACCEPTED SOLUTION

Accepted Solutions
Patrick
Opal | Level 21

@Dave13 A SQL like below will tell you if you've got any such leading blanks in character variables that you're using for the merge.

data RATIO2;
  infile datalines truncover dsd;
  input FILE $ SWISSPROT :$char20. AR1 SEX :$char6. PND DOSE PVALUE1 FDR1;
  datalines;
F1,A0A096MJN4,0.323,F,2,0,0,0
F1,  A0A096MJN4,0.575,F,8,0,8.38E-07,2.35E-06
F1,A0A096MJN4,0.652, F,15,0,0.002007,0.007018,,,,,
F1,A0A096MJZ0,1.214, F,2,0,0.98516,1
;

proc sql;
  select 
    sum(
        case
          when length(swissprot) ne length(left(swissprot)) then 1
          else 0
          end
        ) as n_leading_blk_SWISSPROT,
    sum(
        case
          when length(SEX) ne length(left(SEX)) then 1
          else 0
          end
        ) as n_leading_blk_SEX
  from ratio2
  ;
quit;

 

View solution in original post

8 REPLIES 8
Patrick
Opal | Level 21

From what I understand the code you shared already creates your desired result - except for obs 14 for which there is no matching data in your source tables.

What am I missing?

data PDMEANS2;
  infile datalines truncover;
  input SEX $ DOSE PND SWISSPROT :$20. _TYPE_ _FREQ_ MEAN_AR MED_AR N;
  datalines;
F 0 2 A0A096MJN4 0 6 0.32823 0.32258 6
F 0 8 A0A096MJN4 0 6 0.56312 0.57554 6
F 0 15 A0A096MJN4 0 6 0.67177 0.65325 6
F 3.78 2 A0A096MJN4 0 6 0.32472 0.29637 6
F 3.78 8 A0A096MJN4 0 6 0.47661 0.46189 6
F 3.78 15 A0A096MJN4 0 6 0.67148 0.69292 6
M 0 2 A0A096MJN4 0 6 0.32642 0.34287 6
M 0 8 A0A096MJN4 0 6 0.52017 0.52632 6
M 0 15 A0A096MJN4 0 6 0.62165 0.55316 6
M 3.78 2 A0A096MJN4 0 6 0.3239 0.33203 6
M 3.78 8 A0A096MJN4 0 6 0.50382 0.51008 6
M 3.78 15 A0A096MJN4 0 6 0.83233 0.86276 6
F 0 2 A0A096MJZ0 0 6 1.18446 1.20836 6
;

data RATIO2;
  infile datalines truncover;
  input FILE $ SWISSPROT :$20. AR1 SEX $ PND DOSE PVALUE1 FDR1;
  datalines;
F1 A0A096MJN4 0.323 F 2 0 0 0
F1 A0A096MJN4 0.575 F 8 0 8.38E-07 2.35E-06
F1 A0A096MJN4 0.652 F 15 0 0.002007 0.007018     
F1 A0A096MJZ0 1.214 F 2 0 0.98516 1
;

PROC SORT DATA=RATIO2;
  BY SWISSPROT SEX DOSE PND;
RUN;

PROC SORT DATA=PDMEANS2;
  BY SWISSPROT SEX DOSE PND;
RUN;

DATA SINGLE_PD;
  MERGE PDMEANS2 RATIO2;
  BY SWISSPROT SEX DOSE PND;
RUN;

PROC PRINT DATA=SINGLE_PD;
run;

Patrick_0-1697431938764.png

 

Dave13
Fluorite | Level 6
HI Patrick:
The missing lines were left out for simplicity. They should be filled in with data. I manually cut/paste into an Excel spreadsheet to illustrate what I wanted. I'm including a small example of what I am currently getting. I'm not very proficient at SQL, but have a feeling I may need to improve to solve this issue. Thanks for looking into the question.
Obs FILE SWISSPROT AR1 SEX PND DOSE PVALUE1 FDR1 _TYPE_ _FREQ_ MEAN_AR MED_AR N

1 F1 A0A096MJN4 0.323 F 2 0.00 0.00000 0.00000 0 6 0.32823 0.32258 6
2 A0A096MJN4 . F 8 0.00 . . 0 6 0.56312 0.57554 6
3 A0A096MJN4 . F 15 0.00 . . 0 6 0.67177 0.65325 6
4 A0A096MJN4 . F 2 3.78 . . 0 6 0.32472 0.29637 6
5 A0A096MJN4 . F 8 3.78 . . 0 6 0.47661 0.46189 6
6 A0A096MJN4 . F 15 3.78 . . 0 6 0.67148 0.69292 6
7 A0A096MJN4 . M 2 0.00 . . 0 6 0.32642 0.34287 6
8 A0A096MJN4 . M 8 0.00 . . 0 6 0.52017 0.52632 6
9 A0A096MJN4 . M 15 0.00 . . 0 6 0.62165 0.55316 6
10 A0A096MJN4 . M 2 3.78 . . 0 6 0.32390 0.33203 6
11 A0A096MJN4 . M 8 3.78 . . 0 6 0.50382 0.51008 6
12 A0A096MJN4 . M 15 3.78 . . 0 6 0.83233 0.86276 6
13 F1 A0A096MJZ0 1.214 F 2 0.00 0.98516 1.00000 0 6 1.18446 1.20836 6


ballardw
Super User

@Dave13 wrote:
HI Patrick:
The missing lines were left out for simplicity. They should be filled in with data. I manually cut/paste into an Excel spreadsheet to illustrate what I wanted. I'm including a small example of what I am currently getting. I'm not very proficient at SQL, but have a feeling I may need to improve to solve this issue. Thanks for looking into the question.



I am a little confused. "They should be filled in with data." From where? If the other data set does not have matching data where is the filling data supposed to come from?

 

The only reason I would think of that Proc SQL might be needed is if you have multiple observations in both data sets with the exact same values of the BY statement, and even then you likely need some post processing to deal with the multiple output observations likely to be in the result.

Dave13
Fluorite | Level 6

Yes, the input dataset has an AR1 PVALUE1 FDR1  for each combination. Sorry for the confusing data set examples. I've been trying to keep the example simple, and apparently things are confusing. I get a correct merge for the 1st line of each Swissprot, (sex=F, PND=2, Dose=0 condition) then missing values for the other combinations of sex, pnd, and dose. However, data exists for all of the other combinations of sex, pnd, and dose. I'll continue to work on it this afternoon myself. Thanks for your time and comments.

ballardw
Super User

If you think you matching data but MERGE doesn't use the values then you may have issues with the actual values. If a variable is character you may need to remove leading spaces in values. Many ways of looking at data, such as proc print or report, by default will not show leading spaces in the results window because of procedure defaults:

data example;
   var='    1';output;
   var='   1'; output;
   var=' 1'; output;
   var='1'; output;
run;

proc freq data=example;
run;

Proc print data=example; 
run;

The data defines a variable with up to 5 blanks preceding the numeral. Note that even  though Proc freq counts the  values as different they display as apparently the same.

Proc print just left justifies the values.

 

If you are matching on a numeric value sometime you need to consider rounding to force values to be identical. Also the format assigned may hide the difference in values but the match in merge can tell they are not numerically equal. So a likely candidate for this problem related to Merge is your DOSE variable. What you showed may well have non-displayed digits because of a format like 5.2 which will round 3.776 to 3.78 (or 3.775 or 3.7745 or ....). Best4. might do the same as well.

 

Instructions here: https://communities.sas.com/t5/SAS-Communities-Library/How-to-create-a-data-step-version-of-your-dat... will show how to turn an existing SAS data set into data step code that can be pasted into a forum code box using the </> icon or attached as text to show exactly what you have and that we can test code against.

 

That may help with showing your data in the form of a data step that we can further examine for issues related to merge.

 

You can subset your original data sets to a small example. You may not need all of the variables if you have many more more than your previous examples used.

 

 

Patrick
Opal | Level 21

@Dave13 A SQL like below will tell you if you've got any such leading blanks in character variables that you're using for the merge.

data RATIO2;
  infile datalines truncover dsd;
  input FILE $ SWISSPROT :$char20. AR1 SEX :$char6. PND DOSE PVALUE1 FDR1;
  datalines;
F1,A0A096MJN4,0.323,F,2,0,0,0
F1,  A0A096MJN4,0.575,F,8,0,8.38E-07,2.35E-06
F1,A0A096MJN4,0.652, F,15,0,0.002007,0.007018,,,,,
F1,A0A096MJZ0,1.214, F,2,0,0.98516,1
;

proc sql;
  select 
    sum(
        case
          when length(swissprot) ne length(left(swissprot)) then 1
          else 0
          end
        ) as n_leading_blk_SWISSPROT,
    sum(
        case
          when length(SEX) ne length(left(SEX)) then 1
          else 0
          end
        ) as n_leading_blk_SEX
  from ratio2
  ;
quit;

 

Dave13
Fluorite | Level 6

@Patrick Thanks for the SQL code. I am unfortunately not very practiced with SQL, but will have to practice up. This example is very useful and will solve current and future issues. Thanks for all the help.

Patrick
Opal | Level 21

@Dave13 wrote:

@Patrick Thanks for the SQL code. I am unfortunately not very practiced with SQL, but will have to practice up. This example is very useful and will solve current and future issues. Thanks for all the help.


You can also use a SAS datastep for the same but getting some SQL expertise is certainly worth the effort. 

data check(keep=n_leading_blk_SWISSPROT n_leading_blk_SEX);
  set ratio2 end=last;
  retain n_leading_blk_SWISSPROT n_leading_blk_SEX 0;
  if length(swissprot) ne length(left(swissprot)) then n_leading_blk_SWISSPROT+1;
  if length(SEX) ne length(left(SEX)) then n_leading_blk_SEX+1;
  if last then output;
run;

proc print data=check;
run;

sas-innovate-wordmark-2025-midnight.png

Register Today!

Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.


Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 8 replies
  • 2164 views
  • 2 likes
  • 3 in conversation