Solved: Big Merge Help

iiibbb · Posted 03-13-2012 09:52 AM

I normally don't have trouble with merges, but I am usually doing it with simpler datasets.

I need to merge two datasets (DAILY_PAR_MRG and IPARF_MRG) to create 5,640 unique entries based on the PAR_periods. The DAILY_PAR_MRG file is data from a meteorlogical station. Depending on which measurement period (PAR_PERIOD) the day falls in, a different factor (IPARf) needs to be applied to the variables DAILY_PAR and DAILY_MJ.

I thought this would be staight-forward merge, but the successful dataset will have 5,640 lines; each CLONE (N=15) and REP (N=4) will have 94 values of DAILY_PAR and DAILY_MJ for the dates from June-29 to Sept-30 with the IPARf that corresponds to the PAR_PERIOD for that date.

When I do a merge, I only get 600 lines. I'm not sure what I'm missing. Is this something for SQL? I've never used that before.

proc sort data=IPARf_mrg;

by PAR_Period;

run;

proc sort data=Daily_PAR_mrg;

by PAR_Period;

run;

data cPAR;

mergeDaily_PAR_mrg IPARf_mrg;

by PAR_Period;

proc print;

run;

Thank you.

I don't know why the above is different than the simple case below.

data one; input id $ fruit $12.; datalines; a apple a apricot b banana c coconut c cherry c crabapple ; data two; input id $ color $; datalines; a amber b brown b black c cocoa c cream ; data both; merge one two; by id; run; proc print data=both; run;

iiibbb · Posted 03-13-2012 01:17 PM

I think I've done it in SQL. Found a good tutorial at the UCLA website.

Thanks for the looks.

proc sort data=work.IPARf_mrg;

by PAR_Period;

run;

proc sort data=work.Daily_PAR_mrg;

by PAR_Period;

run;

proc sql;

create table work.cPAR_sql as

select A.PAR_Period, A.Clone, A.Rep, A.IPARf,

B.LUE_Period, B.Date, B.Daily_PAR, B.Daily_MJ

From IPARf_mrg as A, Daily_PAR_mrg as B

Where A.PAR_Period=B.PAR_Period;

quit;

proc sort data=work.cPAR_sql;

by rep clone date;

run;

proc print data=work.cPAR_sql;

run;

View solution in original post

HB · Posted 03-13-2012 11:32 AM

This part here:

data cPAR;

mergeDaily_PAR_mrg IPARf_mrg;

by PAR_Period;

proc print;

run;

is just a typo and is really:

data cPAR;

merge Daily_PAR_mrg (in=a) IPARf_mrg (in=b);

by PAR_Period;

if a;

run;

proc print;

run;

right?

iiibbb · Posted 03-13-2012 12:44 PM

Sorry, that just generates the same 600 line dataset I got with my statement.

Thanks

iiibbb · Posted 03-13-2012 01:17 PM

I think I've done it in SQL. Found a good tutorial at the UCLA website.

Thanks for the looks.

proc sort data=work.IPARf_mrg;

by PAR_Period;

run;

proc sort data=work.Daily_PAR_mrg;

by PAR_Period;

run;

proc sql;

create table work.cPAR_sql as

select A.PAR_Period, A.Clone, A.Rep, A.IPARf,

B.LUE_Period, B.Date, B.Daily_PAR, B.Daily_MJ

From IPARf_mrg as A, Daily_PAR_mrg as B

Where A.PAR_Period=B.PAR_Period;

quit;

proc sort data=work.cPAR_sql;

by rep clone date;

run;

proc print data=work.cPAR_sql;

run;

AndersS · Posted 03-19-2012 05:50 PM

The IF-statement is NOT quite correct. Se below:!

data cPAR;

merge Daily_PAR_mrg (in=a) IPARf_mrg (in=b);

by PAR_Period;

if a or b;

run;

------------------------------

data both; merge one two; by id; run;

This is the same as:
data both; merge one(in=in_one) two (in= in_two); by id;

if in_one=1 or in_two=1 then output; /* if in_one or in_two; */ run;

Anders Sköllermo (Skollermo in English)

Astounding · Posted 03-13-2012 06:57 PM

The nature of the difference is due to the many-to-many matching . MERGE handles that differently than SQL. Try running your final program with ONE, TWO, and BOTH and you'll see that the results are not necessarily what you expected especially when ID=C.

Good luck.

AndersS · Posted 03-19-2012 07:23 PM

It would be good if you could explain more what you want to do: e.g. when PAR_Period= 2 you have a lot of values in DAILY_PAR_MRG and a lot of values in PARF_MRG.

You can combine these by using SQL - then for each PAR_Period value you get #(DAILY_PAR_MRG) times #(PARF_MRG).

Or - you can use merge. Then for each PAR_Period value you get MAX(#(DAILY_PAR_MRG), #(PARF_MRG)) - which is smaller than the value above.

Your text "Dependingon which measurement period (PAR_PERIOD) the day falls in, a different factor(IPARf) needs to be applied to the variables " indicates that some information is missing in your description.

I think you need more by-variables.
Br Anders

Anders Sköllermo (Skollermo in English)

Ksharp · Posted 03-19-2012 10:32 PM

MERGE is not suited for your situation. but SQL cartesian product .

SAS Training: Just a Click Away