BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
inuman
Calcite | Level 5

Hi!

I have 8 different data set, 4 quarter data set from individual level and 4 quarter data set from household level. I would like to combine all these data set into one. Previously I'm using this code:

*For Individual

data work.i07;

  set hw.i071(in=m071) hw.i072(in=m072) hw.i073(in=m073) hw.i074(in=m074);

  if m071=1 then qtr=1;

  else if m072=1 then qtr=2;

  else if m073=1 then qtr=3;

  else if m074=1 then qtr=4;

run;

*For Household

data work.f07;

  set hw.F071(in=f071) hw.F072(in=f072) hw.F073(in=f073) hw.F074(in=f074);

  if f071=1 then qtr=1;

  else if f072=1 then qtr=2;

  else if f073=1 then qtr=3;

  else if f074=1 then qtr=4;\

run;

*Merging both

data hw.all07;

  merge work.f07

work.i07p;

by CU_ID;

run;

That code works, but I wonder if there is any simple code to do it. Perhaps using single data step instead using multiple step. Thank you.

1 ACCEPTED SOLUTION

Accepted Solutions
Scott_Mitchell
Quartz | Level 8

Hi inuman,

It is important to understand that simpler code doesn't necessarily mean better code.  You need to measure how efficient the code is to run and also take into account how easy the code is to interpret by other users.  I have created some code that is very efficient from a keystroke perspective, but was either inefficient to run when bench marked against other methods or unreadable by my colleagues.

I don't really understand why you are creating the variable QTR twice.  If you are merging the datasets with QTR in both, then the "F" datasets QTR variable results will be overwritten with the results from the QTR variable in the "I" dataset once merged.


Keeping this in mind you could do the following:

Dual Set Merge.

DATA FDATA (INDEX = (CU_ID));

  SET HW.F071 HW.F072  HW.F073  HW.F074;

RUN;

DATA WORK.I07;

  SET HW.I071 (IN = M071) HW.I072 (IN = M072) HW.I073 (IN = M073) HW.I074 (IN = M074);

       IF M071=1 THEN QTR=1;

  ELSE IF M072=1 THEN QTR=2;

  ELSE IF M073=1 THEN QTR=3;

  ELSE IF M074=1 THEN QTR=4;

  SET  FDATA KEY = CU_ID;

RUN;

As suggested by KSharp you could use INDSNAME, however this was only implemented in version 9.2.

SQL Merge;

PROC SQL;

CREATE TABLE WORK.I07V2 AS

SELECT * FROM

(SELECT *,

        1 AS QTR FROM HW.I071

UNION ALL

SELECT * ,

        2 AS QTR FROM HW.I072

UNION ALL

SELECT * ,

        3 AS QTR FROM HW.I073

UNION ALL

SELECT * ,

        4 AS QTR FROM HW.I074) AS I,

(SELECT * FROM HW.F071

UNION ALL

SELECT * FROM HW.F072

UNION ALL

SELECT * FROM HW.F073

UNION ALL

SELECT * FROM HW.F074) AS F

WHERE I.CU_ID = F.CU_ID

;

QUIT;

Be careful when using SQL for merges as a Cartesian product is produced and then trimmed to meet the conditions of the where statement, which can be quite inefficient.  SAS datasteps are highly optimized and should be used in most instances, however SQL certainly has it's uses and shouldn't be entirely over looked.  Once again it comes down to benchmarking and testing to ensure that you have the most efficient code for any given situation.

I have made a few assumptions about the structure of your data in the above example, but the basics are here for you to build upon.  You could also perform a hash lookup, but I find these confusing for other users to read, therefore I use them sparingly and only when entirely necessary from a performance standpoint - usually when you have a very small dataset you want to merge against a very large one.

I hope this is useful.  If not, please send a sample of the data you are using and I will attempt to help further.

Regards,

Scott

View solution in original post

3 REPLIES 3
Ksharp
Super User

The one thing I can think is using INDSNAME= option.

data work.i07;

  set hw.i0: indsname=dsn;

name=scan(dsn,-1,'.');

Scott_Mitchell
Quartz | Level 8

Hi inuman,

It is important to understand that simpler code doesn't necessarily mean better code.  You need to measure how efficient the code is to run and also take into account how easy the code is to interpret by other users.  I have created some code that is very efficient from a keystroke perspective, but was either inefficient to run when bench marked against other methods or unreadable by my colleagues.

I don't really understand why you are creating the variable QTR twice.  If you are merging the datasets with QTR in both, then the "F" datasets QTR variable results will be overwritten with the results from the QTR variable in the "I" dataset once merged.


Keeping this in mind you could do the following:

Dual Set Merge.

DATA FDATA (INDEX = (CU_ID));

  SET HW.F071 HW.F072  HW.F073  HW.F074;

RUN;

DATA WORK.I07;

  SET HW.I071 (IN = M071) HW.I072 (IN = M072) HW.I073 (IN = M073) HW.I074 (IN = M074);

       IF M071=1 THEN QTR=1;

  ELSE IF M072=1 THEN QTR=2;

  ELSE IF M073=1 THEN QTR=3;

  ELSE IF M074=1 THEN QTR=4;

  SET  FDATA KEY = CU_ID;

RUN;

As suggested by KSharp you could use INDSNAME, however this was only implemented in version 9.2.

SQL Merge;

PROC SQL;

CREATE TABLE WORK.I07V2 AS

SELECT * FROM

(SELECT *,

        1 AS QTR FROM HW.I071

UNION ALL

SELECT * ,

        2 AS QTR FROM HW.I072

UNION ALL

SELECT * ,

        3 AS QTR FROM HW.I073

UNION ALL

SELECT * ,

        4 AS QTR FROM HW.I074) AS I,

(SELECT * FROM HW.F071

UNION ALL

SELECT * FROM HW.F072

UNION ALL

SELECT * FROM HW.F073

UNION ALL

SELECT * FROM HW.F074) AS F

WHERE I.CU_ID = F.CU_ID

;

QUIT;

Be careful when using SQL for merges as a Cartesian product is produced and then trimmed to meet the conditions of the where statement, which can be quite inefficient.  SAS datasteps are highly optimized and should be used in most instances, however SQL certainly has it's uses and shouldn't be entirely over looked.  Once again it comes down to benchmarking and testing to ensure that you have the most efficient code for any given situation.

I have made a few assumptions about the structure of your data in the above example, but the basics are here for you to build upon.  You could also perform a hash lookup, but I find these confusing for other users to read, therefore I use them sparingly and only when entirely necessary from a performance standpoint - usually when you have a very small dataset you want to merge against a very large one.

I hope this is useful.  If not, please send a sample of the data you are using and I will attempt to help further.

Regards,

Scott

inuman
Calcite | Level 5

Thank you so much for both of your answer (Ksharp and Scott_Mitchell). I'll try your suggestion and see which one is the most efficient way. I'll be back here if I find any unresolved issues. Thank you once again.

Best,

Wisnu

sas-innovate-2024.png

Join us for SAS Innovate April 16-19 at the Aria in Las Vegas. Bring the team and save big with our group pricing for a limited time only.

Pre-conference courses and tutorials are filling up fast and are always a sellout. Register today to reserve your seat.

 

Register now!

What is Bayesian Analysis?

Learn the difference between classical and Bayesian statistical approaches and see a few PROC examples to perform Bayesian analysis in this video.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 3 replies
  • 905 views
  • 3 likes
  • 3 in conversation