Solved: Re: Help with merging

inuman · Posted 07-07-2013 05:54 PM

Hi!

I have 8 different data set, 4 quarter data set from individual level and 4 quarter data set from household level. I would like to combine all these data set into one. Previously I'm using this code:

*For Individual

data work.i07;

set hw.i071(in=m071) hw.i072(in=m072) hw.i073(in=m073) hw.i074(in=m074);

if m071=1 then qtr=1;

else if m072=1 then qtr=2;

else if m073=1 then qtr=3;

else if m074=1 then qtr=4;

run;

*For Household

data work.f07;

set hw.F071(in=f071) hw.F072(in=f072) hw.F073(in=f073) hw.F074(in=f074);

if f071=1 then qtr=1;

else if f072=1 then qtr=2;

else if f073=1 then qtr=3;

else if f074=1 then qtr=4;\

run;

*Merging both

data hw.all07;

merge work.f07

work.i07p;

by CU_ID;

run;

That code works, but I wonder if there is any simple code to do it. Perhaps using single data step instead using multiple step. Thank you.

Scott_Mitchell · Posted 07-07-2013 11:36 PM

Hi inuman,

It is important to understand that simpler code doesn't necessarily mean better code. You need to measure how efficient the code is to run and also take into account how easy the code is to interpret by other users. I have created some code that is very efficient from a keystroke perspective, but was either inefficient to run when bench marked against other methods or unreadable by my colleagues.

I don't really understand why you are creating the variable QTR twice. If you are merging the datasets with QTR in both, then the "F" datasets QTR variable results will be overwritten with the results from the QTR variable in the "I" dataset once merged.

Keeping this in mind you could do the following:

Dual Set Merge.

DATA FDATA (INDEX = (CU_ID));

SET HW.F071 HW.F072 HW.F073 HW.F074;

RUN;

DATA WORK.I07;

SET HW.I071 (IN = M071) HW.I072 (IN = M072) HW.I073 (IN = M073) HW.I074 (IN = M074);

IF M071=1 THEN QTR=1;

ELSE IF M072=1 THEN QTR=2;

ELSE IF M073=1 THEN QTR=3;

ELSE IF M074=1 THEN QTR=4;

SET FDATA KEY = CU_ID;

RUN;

As suggested by KSharp you could use INDSNAME, however this was only implemented in version 9.2.

SQL Merge;

PROC SQL;

CREATE TABLE WORK.I07V2 AS

SELECT * FROM

(SELECT *,

1 AS QTR FROM HW.I071

UNION ALL

SELECT * ,

2 AS QTR FROM HW.I072

UNION ALL

SELECT * ,

3 AS QTR FROM HW.I073

UNION ALL

SELECT * ,

4 AS QTR FROM HW.I074) AS I,

(SELECT * FROM HW.F071

UNION ALL

SELECT * FROM HW.F072

UNION ALL

SELECT * FROM HW.F073

UNION ALL

SELECT * FROM HW.F074) AS F

WHERE I.CU_ID = F.CU_ID

;

QUIT;

Be careful when using SQL for merges as a Cartesian product is produced and then trimmed to meet the conditions of the where statement, which can be quite inefficient. SAS datasteps are highly optimized and should be used in most instances, however SQL certainly has it's uses and shouldn't be entirely over looked. Once again it comes down to benchmarking and testing to ensure that you have the most efficient code for any given situation.

I have made a few assumptions about the structure of your data in the above example, but the basics are here for you to build upon. You could also perform a hash lookup, but I find these confusing for other users to read, therefore I use them sparingly and only when entirely necessary from a performance standpoint - usually when you have a very small dataset you want to merge against a very large one.

I hope this is useful. If not, please send a sample of the data you are using and I will attempt to help further.

Regards,

Scott

View solution in original post

Ksharp · Posted 07-07-2013 09:12 PM

The one thing I can think is using INDSNAME= option.

data work.i07;

set hw.i0: indsname=dsn;

name=scan(dsn,-1,'.');

Scott_Mitchell · Posted 07-07-2013 11:36 PM

Hi inuman,

It is important to understand that simpler code doesn't necessarily mean better code. You need to measure how efficient the code is to run and also take into account how easy the code is to interpret by other users. I have created some code that is very efficient from a keystroke perspective, but was either inefficient to run when bench marked against other methods or unreadable by my colleagues.

I don't really understand why you are creating the variable QTR twice. If you are merging the datasets with QTR in both, then the "F" datasets QTR variable results will be overwritten with the results from the QTR variable in the "I" dataset once merged.

Keeping this in mind you could do the following:

Dual Set Merge.

DATA FDATA (INDEX = (CU_ID));

SET HW.F071 HW.F072 HW.F073 HW.F074;

RUN;

DATA WORK.I07;

SET HW.I071 (IN = M071) HW.I072 (IN = M072) HW.I073 (IN = M073) HW.I074 (IN = M074);

IF M071=1 THEN QTR=1;

ELSE IF M072=1 THEN QTR=2;

ELSE IF M073=1 THEN QTR=3;

ELSE IF M074=1 THEN QTR=4;

SET FDATA KEY = CU_ID;

RUN;

As suggested by KSharp you could use INDSNAME, however this was only implemented in version 9.2.

SQL Merge;

PROC SQL;

CREATE TABLE WORK.I07V2 AS

SELECT * FROM

(SELECT *,

1 AS QTR FROM HW.I071

UNION ALL

SELECT * ,

2 AS QTR FROM HW.I072

UNION ALL

SELECT * ,

3 AS QTR FROM HW.I073

UNION ALL

SELECT * ,

4 AS QTR FROM HW.I074) AS I,

(SELECT * FROM HW.F071

UNION ALL

SELECT * FROM HW.F072

UNION ALL

SELECT * FROM HW.F073

UNION ALL

SELECT * FROM HW.F074) AS F

WHERE I.CU_ID = F.CU_ID

;

QUIT;

Be careful when using SQL for merges as a Cartesian product is produced and then trimmed to meet the conditions of the where statement, which can be quite inefficient. SAS datasteps are highly optimized and should be used in most instances, however SQL certainly has it's uses and shouldn't be entirely over looked. Once again it comes down to benchmarking and testing to ensure that you have the most efficient code for any given situation.

I have made a few assumptions about the structure of your data in the above example, but the basics are here for you to build upon. You could also perform a hash lookup, but I find these confusing for other users to read, therefore I use them sparingly and only when entirely necessary from a performance standpoint - usually when you have a very small dataset you want to merge against a very large one.

I hope this is useful. If not, please send a sample of the data you are using and I will attempt to help further.

Regards,

Scott

inuman · Posted 07-08-2013 12:37 AM

Thank you so much for both of your answer (Ksharp and Scott_Mitchell). I'll try your suggestion and see which one is the most efficient way. I'll be back here if I find any unresolved issues. Thank you once again.

Best,

Wisnu

Help with merging

Re: Help with merging

Re: Help with merging

Re: Help with merging

Re: Help with merging

Registration is open

SAS Training: Just a Click Away