Hi,
I have 2 datasets which I would like to merge into a wide dataset.
Have tried with merging and proc sql but not working properly.
Exampes of the 2 datasets
data part_1;
input id_2 i_id i_date i_prev i_time id_3 id_i g_type$ g_procedure g_date;
informat i_date g_date ddmmyy10. ;
format i_date g_date ddmmyy10.;
datalines;
12345 62233 15.09.2003 99 30 12344 62233 PM 4 15-09-2003
12345 78341 01-09-2012 99 30 78341 62233 PM 4 01-09-2012
12345 95732 15-09-2015 99 30 95732 62233 PM 4 15-09-2015
22222 57321 04-07-1998 0 20 22222 57321 PM 8 04-07-1998
22222 97325 08-03-2001 0 18 22222 97325 PM 1 08-03-2001
;
run;
data part_2;
input id_4 i_id2 id_l la ma_2$ mn_2$ status$6. i_id;
datalines;
12345 62233 5481 2 ME sprint exp 62233
12345 62233 5482 2 ME sprint exp 62233
12345 78341 8921 2 ME optis exp 78341
12345 95732 4841 2 ME optis active 95732
22222 57321 7777 3 GA Tip exp 57321
22222 97325 3901 3 GA Tip active 97325
;
run;
Want the lines to be matched by i_id and then by id (id_2 and id_4) is equal. For each id (_2 and 4) I only want one line at last. The connecting parameter from datasets part_1 and part_2 is i_id.
However as seen in part_2 , same i_id can have 2 observations - when these are converted to wide dataset, I would like to numerate these.
So the want dataset should look like:
12345 62233 15.09.2003 99 30 12344 62233 PM 4 15-09-2003 12345 78341 01-09-2012 99 30 78341 62233 PM 4 01-09-2012 12345 95732 15-09-2015 99 30 95732 62233 PM 4 15-09-2015 12345 62233 5481 2 ME sprint exp 62233 12345 62233 5482 2 ME sprint exp 62233 12345 78341 8921 2 ME optis exp 78341 12345 95732 4841 2 ME optis active 95732
22222 57321 04-07-1998 0 20 22222 57321 PM 8 04-07-1998 22222 97325 08-03-2001 0 18 22222 97325 PM 1 08-03-2001 22222 57321 7777 3 GA Tip exp 57321 22222 97325 3901 3 GA Tip active 97325
So all data from data_1 and data_2 will now be on 2 lines as it only represents 2 patients (id 12345 and 22222)
I tried these:
proc sort data=part_1; by i_id; run;
proc sort data=part_2; by i_id; run;
data wide;
merge part_1 part_2;
by i_id;
run;
proc sql;
create table want as
select *
,a.*
,b.*
from part_1 as a
full join i_id as b
on a.part1 = b.part2;
quit;
However, it didnt work.
Hope anyone can help
However, it didnt work.
Don't make us re-do the work you already did. Tell us what happened. Explain. What didn't work? If there is an error in the log, please show use the ENTIRE log, every single line, every single character, with nothing removed. If the output is wrong, show us what you got and explain why it is wrong.
I tried using merge datastep as i_id is linking the two datasteps. When I have used merge before it has only been for single observations. In data part_2 however one record_id has same i_id with 2 different observations (12345 62233).
I might think merge step is not suitable for this as my output is:
. 233 . . . . . . . 12345 62233 5481 2 ME sprint exp 62
. 233 . . . . . . . 12345 62233 5482 2 ME sprint exp 62
. 321 . . . . . . . 22222 57321 7777 3 GA Tip exp 57
. 341 . . . . . . . 12345 78341 8921 2 ME optis exp 78
22222 57321 04/07/1998 0 20 22222 57321 PM 8 04/07/1998 . . . .
12345 62233 15/09/2003 99 30 12344 62233 PM 4 15/09/2003 . . . .
12345 78341 01/09/2012 99 30 78341 62233 PM 4 01/09/2012 . . . .
12345 95732 15/09/2015 99 30 95732 62233 PM 4 15/09/2015 12345 95732 4841 2 ME optis active
22222 97325 08/03/2001 0 18 22222 97325 PM 1 08/03/2001 22222 97325 3901 3 GA Tip active
I expected 2 lines and this merging code gave the exact same numbers of lines.
When I go through my log, seems no errors and SAS is doing what its told to do. I think Im have misinterpreted what I can use the merge step for. Log:
The SAS System
1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Program (2)';
4 %LET _CLIENTPROCESSFLOWNAME='Process Flow';
5 %LET _CLIENTPROJECTPATH='';
6 %LET _CLIENTPROJECTPATHHOST='';
7 %LET _CLIENTPROJECTNAME='';
8 %LET _SASPROGRAMFILE='';
9 %LET _SASPROGRAMFILEHOST='';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=PNG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 FILENAME EGSR TEMP;
15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
16 STYLE=HTMLBlue
17 STYLESHEET=(URL="file:///C:/Program%20Files/SASHomeGrid/SASEnterpriseGuide/7.1/Styles/HTMLBlue.css")
18 NOGTITLE
19 NOGFOOTNOTE
20 GPATH=&sasworklocation
21 ENCODING=UTF8
22 options(rolap="on")
23 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24
25 GOPTIONS ACCESSIBLE;
26
27 *Tried;
28 proc sort data=part_1; by i_id; run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
29 proc sort data=part_2; by i_id; run;
NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
30
31 data wide;
32 merge part_1 part_2;
33 by i_id;
34 run;
NOTE: There were 5 observations read from the data set WORK.PART_1.
NOTE: There were 6 observations read from the data set WORK.PART_2.
NOTE: The data set WORK.WIDE has 9 observations and 17 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
The SAS System
35
36 GOPTIONS NOACCESSIBLE;
37 %LET _CLIENTTASKLABEL=;
38 %LET _CLIENTPROCESSFLOWNAME=;
39 %LET _CLIENTPROJECTPATH=;
40 %LET _CLIENTPROJECTPATHHOST=;
41 %LET _CLIENTPROJECTNAME=;
42 %LET _SASPROGRAMFILE=;
43 %LET _SASPROGRAMFILEHOST=;
44
45 ;*';*";*/;quit;run;
46 ODS _ALL_ CLOSE;
47
48
49 QUIT; RUN;
50
This type of problem can be easily diagnosed by looking at your SAS data sets with your own eyes. Here is data set part_2.
You have not read the different variables into part_2 properly, and so you get unexpected results. If you fix this, you will get the expected results.
Ah, so I fixed this by deleting the number after $ for status which species the length of characters.
The new being:
data part_1;
input id_2 i_id i_date i_prev i_time id_3 id_i g_type$ g_procedure g_date;
informat i_date g_date ddmmyy10. ;
format i_date g_date ddmmyy10.;
datalines;
12345 62233 15.09.2003 99 30 12344 62233 PM 4 15-09-2003
12345 78341 01-09-2012 99 30 78341 62233 PM 4 01-09-2012
12345 95732 15-09-2015 99 30 95732 62233 PM 4 15-09-2015
22222 57321 04-07-1998 0 20 22222 57321 PM 8 04-07-1998
22222 97325 08-03-2001 0 18 22222 97325 PM 1 08-03-2001
;
run;
data part_2;
input id_4 i_id2 id_l la ma_2$ mn_2$ status i_id;
datalines;
12345 62233 5481 2 ME sprint exp 62233
12345 62233 5482 2 ME sprint exp 62233
12345 78341 8921 2 ME optis exp 78341
12345 95732 4841 2 ME optis active 95732
22222 57321 7777 3 GA Tip exp 57321
22222 97325 3901 3 GA Tip active 97325
;
run;
Then tried merging again around i_id, expecting now only 2 lines as output . One for id 12345 and one for 22222.
It sadly gives me 7 lines
22222 57321 04/07/1998 0 20 22222 57321 PM 8 04/07/1998 22222 57321 7777 3 GA Tip .
12345 62233 15/09/2003 99 30 12344 62233 PM 4 15/09/2003 12345 62233 5481 2 ME sprint .
12345 62233 15/09/2003 99 30 12344 62233 PM 4 15/09/2003 12345 62233 5482 2 ME sprint .
12345 78341 01/09/2012 99 30 78341 62233 PM 4 01/09/2012 12345 78341 8921 2 ME optis .
12345 95732 15/09/2015 99 30 95732 62233 PM 4 15/09/2015 12345 95732 4841 2 ME optis .
22222 97325 08/03/2001 0 18 22222 97325 PM 1 08/03/2001 22222 97325 3901 3 GA Tip .
Is it because I shall make a step with new columns and rename them as goes for repeating events - like 1, 2, 3 and so on ?
Here is the fix for part_2
input id_4 i_id2 id_l la ma_2 $ mn_2 $ status :$6. i_id;
Note the colon informat modifier. From the SAS Documentation:
For a character variable, this format modifier reads the value from the next non-blank column until the pointer reaches the next blank column, the defined length of the variable, or the end of the data line, whichever comes first.
Thanks the colon informat modifier helped to not continuing reading .
What are the key variables you want to merge on? Is it just I_ID?
What do you want to happen for I_ID = 62,233 that appears twice in PART_2?
A long as only one dataset is contributing more than observation to any BY group you can merge using data step.
proc sort data=part_1; by i_id; run;
proc sort data=part_2; by i_id; run;
data want;
merge part_1 part_2;
by i_id;
run;
proc print data=want;
by i_id;
id i_id;
run;
Results:
i_id id_2 i_date i_prev i_time id_3 id_i g_type g_procedure g_date id_4 i_id2 id_l la ma_2 mn_2 status 57321 22222 04/07/1998 0 20 22222 57321 PM 8 04/07/1998 22222 57321 7777 3 GA Tip exp 62233 12345 15/09/2003 99 30 12344 62233 PM 4 15/09/2003 12345 62233 5481 2 ME sprint exp 12345 15/09/2003 99 30 12344 62233 PM 4 15/09/2003 12345 62233 5482 2 ME sprint exp 78341 12345 01/09/2012 99 30 78341 62233 PM 4 01/09/2012 12345 78341 8921 2 ME optis exp 95732 12345 15/09/2015 99 30 95732 62233 PM 4 15/09/2015 12345 95732 4841 2 ME optis active 97325 22222 08/03/2001 0 18 22222 97325 PM 1 08/03/2001 22222 97325 3901 3 GA Tip active
Note: Your posted data step code for PART_2 is not going to work right becuase you included an informat in the INPUT statement without use the : modifier to make sure the line is parsed using LIST MODE.
data part_2;
input id_4 i_id2 id_l la ma_2 $ mn_2 $ status :$6. i_id;
datalines;
12345 62233 5481 2 ME sprint exp 62233
12345 62233 5482 2 ME sprint exp 62233
12345 78341 8921 2 ME optis exp 78341
12345 95732 4841 2 ME optis active 95732
22222 57321 7777 3 GA Tip exp 57321
22222 97325 3901 3 GA Tip active 97325
;
Would like to merge on I_ID.
Whats called id_2 in data part_1 is equal to id_4 in part_2. However I dont need double of these. Thought I could delete on of the columns after successful merging.
The essential for me is to have all data corresponding to each id_2/4 on ONE row. Thats what I meant about the wide dataset.
For 12345 (id_2) I have 3 observations in data part_1 and in data part_2 I have 4 observations for this particular id.
I planned to do this in two steps:
1) merge everything by i_id
2) change the long to wide dataset (i_id 62233) has one row in data part_1 and 2 rows in data part_2. I would like to have them on same row, the second observation eventuel with the suffix: _2 to differ between nr 1 and 2 observaion for that particular i_id.
Now, Im not sure how I get
62233 12345 15/09/2003 99 30 12344 62233 PM 4 15/09/2003 12345 62233 5481 2 ME sprint exp
12345 15/09/2003 99 30 12344 62233 PM 4 15/09/2003 12345 62233 5482 2 ME sprint exp
on one line same line.
At last I should end up with two rows corresponding only to two record_ids with all their information on same line like this:
12345 62233 15.09.2003 99 30 12344 62233 PM 4 15-09-2003 12345 78341 01-09-2012 99 30 78341 62233 PM 4 01-09-2012 12345 95732 15-09-2015 99 30 95732 62233 PM 4 15-09-2015 12345 62233 5481 2 ME sprint exp 62233 12345 62233 5482 2 ME sprint exp 62233 12345 78341 8921 2 ME optis exp 78341 12345 95732 4841 2 ME optis active 95732
22222 57321 04-07-1998 0 20 22222 57321 PM 8 04-07-1998 22222 97325 08-03-2001 0 18 22222 97325 PM 1 08-03-2001 22222 57321 7777 3 GA Tip exp 57321 22222 97325 3901 3 GA Tip active 97325
@PaigeMiller were also nice to mention the semicolon in my input statement. I changed it and that worked
This merging step now worked:
data wide;
merge part_1 part_2;
by i_id;
run;
Gave me this output:
Which data- og proc statement can I use to get the output collected?
Hope you guys can help my
Your subject line mentions LONG to WIDE format, but your question seems to be about either MERGE statement or possibly proper use of informats with the iNPUT statement.
What is your question about LONG to WIDE?
Can you describe the input and output structures?
Are you trying to convert multiple observations into a single one? Why ? What do you plan to do with it that requires a single observation?
Yea, I guess I hoped to solve my question by divided in
1, merging step
2, making wide step.
My question about long to wide is - as I have multiple events (actually up to 20 for each i_id), how can I convert it to wide ?
I was thinking proc transpose, but using it the simple way:
proc transpose data=want out=answer;
run;
I get:
_NAME_ COL1 COL2 COL3 COL4 COL5 COL6
id_2 22222 12345 12345 12345 12345 22222
i_id 57321 62233 62233 78341 95732 97325
i_date 14064 15963 15963 19237 20346 15042
i_prev 0 99 99 99 99 0
i_time 20 30 30 30 30 18
id_3 22222 12344 12344 78341 95732 22222
id_i 57321 62233 62233 62233 62233 97325
g_procedure 8 4 4 4 4 1
g_date 14064 15963 15963 19237 20346 15042
id_4 22222 12345 12345 12345 12345 22222
i_id2 57321 62233 62233 78341 95732 97325
id_l 7777 5481 5482 8921 4841 3901
la 3 2 2 2 2 3
which is not exactly what I want... Mayby I need some by statement?
As I have different variable names, seems I cant just use "var" statement 😕
You need to explain more about what your data represents to get a better answer.
In general the reason you have multiple observations is you have repeated measures for some variable. For example you might measure heart rate multiple times.
data have;
input id hr ;
cards;
1 75
1 70
2 56
2 50
3 80
3 79
;
In general leaving the data in this form is easier for analysis.
If you want to transpose the data from long to wide you can use PROC TRANSPOSE.
proc transpose data=have out=want(drop=_name_) prefix=hr ;
by id;
var hr;
run;
Which yields a dataset like this:
data want;
input id hr1 hr2;
cards;
1 75 70
2 56 50
3 80 79
;
But that works best with simple examples like the one above where there is only one variable that is measured multiple times.
yes, I see how the transposing is working with only one repeated variable. But I have a lot which makes it difficult for me to get right
In my case, I have for each patient :
- multiple operations
- mutilpe types of operations.
So I want outcome to be in this format:
Patient_1 Operation_1 Operation_type_info_1 Operation_2 Operation_type_info_2 Operation_3 Operation_type_info_3_part1 Operation_type_info_3_part2
Patient_2 Operation_1 Operation_type_info_1 Operation_2 Operation_type_info_2
So data from id_2 -- g_date represents operations (dataset part_1).
And data from id_4 -- i_id (dataset part_2)
The id_2 and id_4 is the same number as this represent each patient in different dataset (part_1 and part_2). i_id represents operation code beeing the only operator that connect data from operation and type of operation.
Sorted by the i_date I need eldeste operation to be first in the line and newest operation to be last in the line (already proc sorted i_date in my early step)
I showed the expected output earlier with only 2 lines including all information for each patient gathered in one line.
The reason I should get it on one line is to do analysis like time between operation, characterics about percentage of operation type for first, second,... operation. Hope this give a better understanding of my data ?
Such statistics can be better derived from the long layout. Wide layout is needed only for regression analysis (where you would spread out the parameters like frequency of operation type or time between operation that you derive from the long dataset).
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9.
Lock in the best rate now before the price increases on April 1.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.