BookmarkSubscribeRSS Feed
Rahim_221
Calcite | Level 5

Hello,

I have to merge two data sets by subject_id and measure date. 

 

First data set is as follows, (data1)

Subject_id.                 Measure-date

ABC0001                    8/2017

ABC0002                     10/2020

ABC0005                      9/2017

 

Second data set is as follows; (data2)

Subject_id              Measuredate              height              weight           record_number

ABC0001                 5/2012                       176                    56                  500

ABC0001                 6/2016                       180                    70                   500

ABC0001                 8/2017                       182                    99                    500

ABC0002                 10/2020                      160                   55                    510

ABC0002                  11/2021                      170                    66                  510

ABC0005                   9/2017                        190                  80                   502

 

 

Desired data is as follows; (data want)

Subject_id              measure date              height           weight        record_number

ABC0001                   8/2017                       182                99                500

ABC0002                   10/2020                      160              55                  510

ABC0005                  9/2017                          190            80                  502

 

 

Problem is as follows:

The data that I get:

Subject_id              measure date              height           weight        record_number

ABC0001                   8/2017                       182                99                500

ABC0002                   10/2020                      .                        .                 .

ABC0005                  9/2017                          .                       .                  .

 

 

The dataset I am working on is ACTUALLY thousands of observations. When I merge the data sets, I get some observations merged as desired, while the majority of the other observations are appearing as missing data. I am not sure what the problem is.

 

The code I am using is:

Proc sort data=data1; by subject_id  measuredate;run;

proc sort data=data2; by  subject_id   measure date; run;

 

data want;

merge data1 (in=a)        data2 (in=b); by subject_id  measuredate; if b; run;\

 

 

 

 

 

8 REPLIES 8
SASKiwi
PROC Star

I strongly suspect that "measure date" is your problem. It appears to be a character variable and it looks like it is affected by leading blanks. To confirm what your variable types are run this:

proc contents data = data1;
run;

proc contents data = data2;
run;

If measure date is character, then it would be a very good idea to convert it to a numeric SAS date. That might require adding a day to the month and year you currently have like the first or last day of the month.

 

Rahim_221
Calcite | Level 5
Hello,
Thank you for attempting to answer my question. I would like to provide some additional discoveries on my end.
The measuredate column in dataset 1 is made of day, month, and year, while the same variable in dataset 2 is only made of a month and year. Therefore, in dataset 1, I created a new measuredate column (from the original measuredate col) that consists only of month and year and I did that to match the variable measuredate in dataset 2. Even tho the newly created col of only month and yr is visible as (09/2017, for example) it is still actually a day, month, and year. I used the below code to create a to component date from the three components date.

So, what I have now is a key merging column in dataset 2 that consist of a two component date, while its equivalent in dataset 1 is a three component date that appears as a two component date.
I hope I was able to properly represent this to you as I believe the problem lies there.
SASJedi
SAS Super FREQ

@Rahim_221 I don't see how the data you showed in the initial question could possibly be a three-component date. You show it as a single variable, and the values only display month and year:

First data set is as follows, (data1)

Subject_id.                 Measuredate

ABC0001                    8/2017

ABC0002                     10/2020

ABC0005                      9/2017

The only way that those values could contain a day portion is if they were numeric SAS dates being displayed with a format like mmyys10. For the other data set, you say Measuredate contains only month and year. For that to be true, it would have to be a character (text) value. And if both of those things are true, you can't perform a SAS DATA step merge based on variables of different types.  So it would be useful to provide code that will re-recreate the two data sets to help us better understand the issues you are experiencing. 

You could try using the data2datastep macro to produce the code required to reproduce your data. For example, this SAS program grabs the source code for the data2datastep macro from the internet and compiles the macro for you. Then, a macro call reads the cars dataset from the sashelp library and generates a program that will re-create 10 observations of the data in the work library. The program is saved to a file named make_cars_data.sas and in home directory (~/😞

/* Get and compile the macro code from GitHub */
filename getmacro url "https://raw.githubusercontent.com/SASJedi/sas-macros/master/data2datastep.sas";
%include getmacro;
filename getmacro clear;

/* Call the macro to do the work */
%DATA2DATASTEP(cars,sashelp,work,~/make_cars_data.sas,10)

If you just leave the last parameter blank, the whole dataset will be re-created. After running that program, the make_cars_data.sas looks like this:

data work.CARS(label='2004 Car Data');
  infile datalines dsd truncover;
  input Make:$13. Model:$40. Type:$8. Origin:$6. DriveTrain:$5. MSRP:DOLLAR8. Invoice:DOLLAR8. EngineSize:32. Cylinders:32. Horsepower:32. MPG_City:32. MPG_Highway:32. Weight:32. Wheelbase:32. Length:32.;
  format MSRP DOLLAR8. Invoice DOLLAR8.;
  label EngineSize="Engine Size (L)" MPG_City="MPG (City)" MPG_Highway="MPG (Highway)" Weight="Weight (LBS)" Wheelbase="Wheelbase (IN)" Length="Length (IN)";
datalines4;
Acura,MDX,SUV,Asia,All,"$36,945","$33,337",3.5,6,265,17,23,4451,106,189
Acura,RSX Type S 2dr,Sedan,Asia,Front,"$23,820","$21,761",2,4,200,24,31,2778,101,172
Acura,TSX 4dr,Sedan,Asia,Front,"$26,990","$24,647",2.4,4,200,22,29,3230,105,183
Acura,TL 4dr,Sedan,Asia,Front,"$33,195","$30,299",3.2,6,270,20,28,3575,108,186
Acura,3.5 RL 4dr,Sedan,Asia,Front,"$43,755","$39,014",3.5,6,225,18,24,3880,115,197
;;;;

If you do that for both of your data sets and then share the resulting code to reproduce them here, we can help resolve the problem more quickly. 

All the best,

Mark

 

Check out my Jedi SAS Tricks for SAS Users
Rahim_221
Calcite | Level 5
Thank you Mark for your reply. The variable measuredate in my data set is
actually a variable that I created for both my data sets to be able to
merge. It is in a format mmyys10.

The reason I did that, is because my data1 contains a date variable that
has only month and year, while date set 2 has a date variable in format
mmddyy10.
So I can’t merge two data sets by date variable if it’s not in identical
format in both data set, that’s why I created measuredate in mmyys10.

I now know the reason why the merge is incorrect, because the measuredate
variable is formated but doesn’t mean the day is not there anymore. It
means that it is only showing month and year but day is still there not
showing. So when I attempted to merge, the day is not matching in both data
sets.

Thanks
Kurt_Bremser
Super User

If both variables are actually SAS date values, align them to the first day of the month by using the INTNX function with the "b" modifier.

measure_date = intnx('month',measure_date,0,'b');
Tom
Super User Tom
Super User

It is only going to work if the variables are storing the same type of information.  

If they are numeric variables with date values then make sure they are aligned to the same date of the month.

measure_date = intnx('month',measure_date,0);

If they are character strings make sure they are consistently formatted.  Do the months have leading zeros or not?  Do the values have leading spaces or not?  It would probably be better to convert them into date values instead of leaving them as character strings.

measure_date_number = input(cats('1/',measure_date),ddmmyy10.);
format measure_date_number yymms7.;

If you leave them as charcter strings then standardize the style used.

measure_date = put(input(cats('1/',measure_date),ddmmyy10.),yymms7.);

Note that it is best to store dates as character strings in YMD order so that when sorted they will appear in chronological order.

mkeintz
PROC Star

You apparently want one record per id, containing measure_date from DATA1 (which only has 1 obs per id), and all other variables from the most recent obs in DATA2.

 

If so, assuming data1 is sorted by ID, and data2 is sorted by ID/measure_date, then:

 

data want;
  merge data1 data2 (drop=measure_date);
  by id;
  if last.id;
run;

 

 

 

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

Ready to join fellow brilliant minds for the SAS Hackathon?

Build your skills. Make connections. Enjoy creative freedom. Maybe change the world. Registration is now open through August 30th. Visit the SAS Hackathon homepage.

Register today!
Mastering the WHERE Clause in PROC SQL

SAS' Charu Shankar shares her PROC SQL expertise by showing you how to master the WHERE clause using real winter weather data.

Find more tutorials on the SAS Users YouTube channel.

Discussion stats
  • 8 replies
  • 1315 views
  • 0 likes
  • 6 in conversation