Hi!
I have 8 different data set, 4 quarter data set from individual level and 4 quarter data set from household level. I would like to combine all these data set into one. Previously I'm using this code:
*For Individual
data work.i07;
set hw.i071(in=m071) hw.i072(in=m072) hw.i073(in=m073) hw.i074(in=m074);
if m071=1 then qtr=1;
else if m072=1 then qtr=2;
else if m073=1 then qtr=3;
else if m074=1 then qtr=4;
run;
*For Household
data work.f07;
set hw.F071(in=f071) hw.F072(in=f072) hw.F073(in=f073) hw.F074(in=f074);
if f071=1 then qtr=1;
else if f072=1 then qtr=2;
else if f073=1 then qtr=3;
else if f074=1 then qtr=4;\
run;
*Merging both
data hw.all07;
merge work.f07
work.i07p;
by CU_ID;
run;
That code works, but I wonder if there is any simple code to do it. Perhaps using single data step instead using multiple step. Thank you.
Hi inuman,
It is important to understand that simpler code doesn't necessarily mean better code. You need to measure how efficient the code is to run and also take into account how easy the code is to interpret by other users. I have created some code that is very efficient from a keystroke perspective, but was either inefficient to run when bench marked against other methods or unreadable by my colleagues.
I don't really understand why you are creating the variable QTR twice. If you are merging the datasets with QTR in both, then the "F" datasets QTR variable results will be overwritten with the results from the QTR variable in the "I" dataset once merged.
Keeping this in mind you could do the following:
Dual Set Merge.
DATA FDATA (INDEX = (CU_ID));
SET HW.F071 HW.F072 HW.F073 HW.F074;
RUN;
DATA WORK.I07;
SET HW.I071 (IN = M071) HW.I072 (IN = M072) HW.I073 (IN = M073) HW.I074 (IN = M074);
IF M071=1 THEN QTR=1;
ELSE IF M072=1 THEN QTR=2;
ELSE IF M073=1 THEN QTR=3;
ELSE IF M074=1 THEN QTR=4;
SET FDATA KEY = CU_ID;
RUN;
As suggested by KSharp you could use INDSNAME, however this was only implemented in version 9.2.
SQL Merge;
PROC SQL;
CREATE TABLE WORK.I07V2 AS
SELECT * FROM
(SELECT *,
1 AS QTR FROM HW.I071
UNION ALL
SELECT * ,
2 AS QTR FROM HW.I072
UNION ALL
SELECT * ,
3 AS QTR FROM HW.I073
UNION ALL
SELECT * ,
4 AS QTR FROM HW.I074) AS I,
(SELECT * FROM HW.F071
UNION ALL
SELECT * FROM HW.F072
UNION ALL
SELECT * FROM HW.F073
UNION ALL
SELECT * FROM HW.F074) AS F
WHERE I.CU_ID = F.CU_ID
;
QUIT;
Be careful when using SQL for merges as a Cartesian product is produced and then trimmed to meet the conditions of the where statement, which can be quite inefficient. SAS datasteps are highly optimized and should be used in most instances, however SQL certainly has it's uses and shouldn't be entirely over looked. Once again it comes down to benchmarking and testing to ensure that you have the most efficient code for any given situation.
I have made a few assumptions about the structure of your data in the above example, but the basics are here for you to build upon. You could also perform a hash lookup, but I find these confusing for other users to read, therefore I use them sparingly and only when entirely necessary from a performance standpoint - usually when you have a very small dataset you want to merge against a very large one.
I hope this is useful. If not, please send a sample of the data you are using and I will attempt to help further.
Regards,
Scott
The one thing I can think is using INDSNAME= option.
data work.i07;
set hw.i0: indsname=dsn;
name=scan(dsn,-1,'.');
Hi inuman,
It is important to understand that simpler code doesn't necessarily mean better code. You need to measure how efficient the code is to run and also take into account how easy the code is to interpret by other users. I have created some code that is very efficient from a keystroke perspective, but was either inefficient to run when bench marked against other methods or unreadable by my colleagues.
I don't really understand why you are creating the variable QTR twice. If you are merging the datasets with QTR in both, then the "F" datasets QTR variable results will be overwritten with the results from the QTR variable in the "I" dataset once merged.
Keeping this in mind you could do the following:
Dual Set Merge.
DATA FDATA (INDEX = (CU_ID));
SET HW.F071 HW.F072 HW.F073 HW.F074;
RUN;
DATA WORK.I07;
SET HW.I071 (IN = M071) HW.I072 (IN = M072) HW.I073 (IN = M073) HW.I074 (IN = M074);
IF M071=1 THEN QTR=1;
ELSE IF M072=1 THEN QTR=2;
ELSE IF M073=1 THEN QTR=3;
ELSE IF M074=1 THEN QTR=4;
SET FDATA KEY = CU_ID;
RUN;
As suggested by KSharp you could use INDSNAME, however this was only implemented in version 9.2.
SQL Merge;
PROC SQL;
CREATE TABLE WORK.I07V2 AS
SELECT * FROM
(SELECT *,
1 AS QTR FROM HW.I071
UNION ALL
SELECT * ,
2 AS QTR FROM HW.I072
UNION ALL
SELECT * ,
3 AS QTR FROM HW.I073
UNION ALL
SELECT * ,
4 AS QTR FROM HW.I074) AS I,
(SELECT * FROM HW.F071
UNION ALL
SELECT * FROM HW.F072
UNION ALL
SELECT * FROM HW.F073
UNION ALL
SELECT * FROM HW.F074) AS F
WHERE I.CU_ID = F.CU_ID
;
QUIT;
Be careful when using SQL for merges as a Cartesian product is produced and then trimmed to meet the conditions of the where statement, which can be quite inefficient. SAS datasteps are highly optimized and should be used in most instances, however SQL certainly has it's uses and shouldn't be entirely over looked. Once again it comes down to benchmarking and testing to ensure that you have the most efficient code for any given situation.
I have made a few assumptions about the structure of your data in the above example, but the basics are here for you to build upon. You could also perform a hash lookup, but I find these confusing for other users to read, therefore I use them sparingly and only when entirely necessary from a performance standpoint - usually when you have a very small dataset you want to merge against a very large one.
I hope this is useful. If not, please send a sample of the data you are using and I will attempt to help further.
Regards,
Scott
Thank you so much for both of your answer (Ksharp and Scott_Mitchell). I'll try your suggestion and see which one is the most efficient way. I'll be back here if I find any unresolved issues. Thank you once again.
Best,
Wisnu
Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.
Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.
Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.
Find more tutorials on the SAS Users YouTube channel.