Re: Long to wide format with multiple events

rookie21 · Posted 12-20-2021 12:00 PM

Hi,

I have 2 datasets which I would like to merge into a wide dataset.

Have tried with merging and proc sql but not working properly.

Exampes of the 2 datasets

data part_1;
input id_2 i_id i_date i_prev i_time id_3 id_i g_type$ g_procedure g_date;
informat i_date g_date ddmmyy10. ;
format i_date g_date ddmmyy10.;
datalines;
12345 62233 15.09.2003 99 30 12344 62233 PM 4 15-09-2003
12345 78341 01-09-2012 99 30 78341 62233 PM 4 01-09-2012
12345 95732 15-09-2015 99 30 95732 62233 PM 4 15-09-2015
22222 57321 04-07-1998 0  20 22222 57321 PM 8 04-07-1998
22222 97325 08-03-2001 0  18 22222 97325 PM 1 08-03-2001
; 
run; 

data part_2; 
input id_4 i_id2 id_l la ma_2$ mn_2$ status$6. i_id; 
datalines;
12345 62233 5481 2 ME sprint exp 62233
12345 62233 5482 2 ME sprint exp 62233
12345 78341 8921 2 ME optis exp 78341
12345 95732 4841 2 ME optis active 95732
22222 57321 7777 3 GA Tip exp 57321
22222 97325 3901 3 GA Tip active 97325
; 
run;

Want the lines to be matched by i_id and then by id (id_2 and id_4) is equal. For each id (_2 and 4) I only want one line at last. The connecting parameter from datasets part_1 and part_2 is i_id.

However as seen in part_2 , same i_id can have 2 observations - when these are converted to wide dataset, I would like to numerate these.

So the want dataset should look like:

12345 62233 15.09.2003 99 30 12344 62233 PM 4 15-09-2003 12345 78341 01-09-2012 99 30 78341 62233 PM 4 01-09-2012 12345 95732 15-09-2015 99 30 95732 62233 PM 4 15-09-2015 12345 62233 5481 2 ME sprint exp 62233 12345 62233 5482 2 ME sprint exp 62233 12345 78341 8921 2 ME optis exp 78341 12345 95732 4841 2 ME optis active 95732
22222 57321 04-07-1998 0 20 22222 57321 PM 8 04-07-1998 22222 97325 08-03-2001 0 18 22222 97325 PM 1 08-03-2001 22222 57321 7777 3 GA Tip exp 57321 22222 97325 3901 3 GA Tip active 97325

So all data from data_1 and data_2 will now be on 2 lines as it only represents 2 patients (id 12345 and 22222)

I tried these:

proc sort data=part_1; by i_id; run;
proc sort data=part_2; by i_id; run;

data wide;
merge part_1 part_2;
by i_id;
run;

proc sql;
create table want as
select *
,a.*
,b.*
from part_1 as a
full join i_id as b
on a.part1 = b.part2;
quit;

However, it didnt work.

Hope anyone can help

PaigeMiller · Posted 12-20-2021 12:03 PM

However, it didnt work.

Don't make us re-do the work you already did. Tell us what happened. Explain. What didn't work? If there is an error in the log, please show use the ENTIRE log, every single line, every single character, with nothing removed. If the output is wrong, show us what you got and explain why it is wrong.

--
Paige Miller

rookie21 · Posted 12-20-2021 12:10 PM

I tried using merge datastep as i_id is linking the two datasteps. When I have used merge before it has only been for single observations. In data part_2 however one record_id has same i_id with 2 different observations (12345 62233).

I might think merge step is not suitable for this as my output is:

. 233 . . . . . . . 12345 62233 5481 2 ME sprint exp 62
. 233 . . . . . . . 12345 62233 5482 2 ME sprint exp 62
. 321 . . . . . . . 22222 57321 7777 3 GA Tip exp 57
. 341 . . . . . . . 12345 78341 8921 2 ME optis exp 78
22222 57321 04/07/1998 0 20 22222 57321 PM 8 04/07/1998 . . . .
12345 62233 15/09/2003 99 30 12344 62233 PM 4 15/09/2003 . . . .
12345 78341 01/09/2012 99 30 78341 62233 PM 4 01/09/2012 . . . .
12345 95732 15/09/2015 99 30 95732 62233 PM 4 15/09/2015 12345 95732 4841 2 ME optis active

22222 97325 08/03/2001 0 18 22222 97325 PM 1 08/03/2001 22222 97325 3901 3 GA Tip active

I expected 2 lines and this merging code gave the exact same numbers of lines.

When I go through my log, seems no errors and SAS is doing what its told to do. I think Im have misinterpreted what I can use the merge step for. Log:

The SAS System

1 ;*';*";*/;quit;run;
2 OPTIONS PAGENO=MIN;
3 %LET _CLIENTTASKLABEL='Program (2)';
4 %LET _CLIENTPROCESSFLOWNAME='Process Flow';
5 %LET _CLIENTPROJECTPATH='';
6 %LET _CLIENTPROJECTPATHHOST='';
7 %LET _CLIENTPROJECTNAME='';
8 %LET _SASPROGRAMFILE='';
9 %LET _SASPROGRAMFILEHOST='';
10
11 ODS _ALL_ CLOSE;
12 OPTIONS DEV=PNG;
13 GOPTIONS XPIXELS=0 YPIXELS=0;
14 FILENAME EGSR TEMP;
15 ODS tagsets.sasreport13(ID=EGSR) FILE=EGSR
16 STYLE=HTMLBlue
17 STYLESHEET=(URL="file:///C:/Program%20Files/SASHomeGrid/SASEnterpriseGuide/7.1/Styles/HTMLBlue.css")
18 NOGTITLE
19 NOGFOOTNOTE
20 GPATH=&sasworklocation
21 ENCODING=UTF8
22 options(rolap="on")
23 ;
NOTE: Writing TAGSETS.SASREPORT13(EGSR) Body file: EGSR
24
25 GOPTIONS ACCESSIBLE;
26
27 *Tried;

28 proc sort data=part_1; by i_id; run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds

29 proc sort data=part_2; by i_id; run;

NOTE: Input data set is already sorted, no sorting done.
NOTE: PROCEDURE SORT used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds

30
31 data wide;
32 merge part_1 part_2;
33 by i_id;
34 run;

NOTE: There were 5 observations read from the data set WORK.PART_1.
NOTE: There were 6 observations read from the data set WORK.PART_2.
NOTE: The data set WORK.WIDE has 9 observations and 17 variables.
NOTE: DATA statement used (Total process time):
real time 0.00 seconds
cpu time 0.00 seconds
The SAS System

35
36 GOPTIONS NOACCESSIBLE;
37 %LET _CLIENTTASKLABEL=;
38 %LET _CLIENTPROCESSFLOWNAME=;
39 %LET _CLIENTPROJECTPATH=;
40 %LET _CLIENTPROJECTPATHHOST=;
41 %LET _CLIENTPROJECTNAME=;
42 %LET _SASPROGRAMFILE=;
43 %LET _SASPROGRAMFILEHOST=;
44
45 ;*';*";*/;quit;run;
46 ODS _ALL_ CLOSE;
47
48
49 QUIT; RUN;
50

PaigeMiller · Posted 12-20-2021 12:29 PM

This type of problem can be easily diagnosed by looking at your SAS data sets with your own eyes. Here is data set part_2.

You have not read the different variables into part_2 properly, and so you get unexpected results. If you fix this, you will get the expected results.

--
Paige Miller

rookie21 · Posted 12-20-2021 12:39 PM

Ah, so I fixed this by deleting the number after $ for status which species the length of characters.

The new being:

data part_1;
input id_2 i_id i_date i_prev i_time id_3 id_i g_type$ g_procedure g_date;
informat i_date g_date ddmmyy10. ;
format i_date g_date ddmmyy10.;
datalines;
12345 62233 15.09.2003 99 30 12344 62233 PM 4 15-09-2003
12345 78341 01-09-2012 99 30 78341 62233 PM 4 01-09-2012
12345 95732 15-09-2015 99 30 95732 62233 PM 4 15-09-2015
22222 57321 04-07-1998 0  20 22222 57321 PM 8 04-07-1998
22222 97325 08-03-2001 0  18 22222 97325 PM 1 08-03-2001
; 
run; 

data part_2; 
input id_4 i_id2 id_l la ma_2$ mn_2$ status i_id; 
datalines;
12345 62233 5481 2 ME sprint exp 62233
12345 62233 5482 2 ME sprint exp 62233
12345 78341 8921 2 ME optis exp 78341
12345 95732 4841 2 ME optis active 95732
22222 57321 7777 3 GA Tip exp 57321
22222 97325 3901 3 GA Tip active 97325
; 
run;

Then tried merging again around i_id, expecting now only 2 lines as output . One for id 12345 and one for 22222.
It sadly gives me 7 lines

22222 57321 04/07/1998 0 20 22222 57321 PM 8 04/07/1998 22222 57321 7777 3 GA Tip .
12345 62233 15/09/2003 99 30 12344 62233 PM 4 15/09/2003 12345 62233 5481 2 ME sprint .
12345 62233 15/09/2003 99 30 12344 62233 PM 4 15/09/2003 12345 62233 5482 2 ME sprint .
12345 78341 01/09/2012 99 30 78341 62233 PM 4 01/09/2012 12345 78341 8921 2 ME optis .
12345 95732 15/09/2015 99 30 95732 62233 PM 4 15/09/2015 12345 95732 4841 2 ME optis .
22222 97325 08/03/2001 0 18 22222 97325 PM 1 08/03/2001 22222 97325 3901 3 GA Tip .

Is it because I shall make a step with new columns and rename them as goes for repeating events - like 1, 2, 3 and so on ?

PaigeMiller · Posted 12-20-2021 12:50 PM

Here is the fix for part_2

input id_4 i_id2 id_l la ma_2 $ mn_2 $ status :$6. i_id;

Note the colon informat modifier. From the SAS Documentation:

For a character variable, this format modifier reads the value from the next non-blank column until the pointer reaches the next blank column, the defined length of the variable, or the end of the data line, whichever comes first.

--
Paige Miller

rookie21 · Posted 12-20-2021 04:24 PM

Thanks the colon informat modifier helped to not continuing reading .

Tom · Posted 12-20-2021 12:41 PM

What are the key variables you want to merge on? Is it just I_ID?

What do you want to happen for I_ID = 62,233 that appears twice in PART_2?

A long as only one dataset is contributing more than observation to any BY group you can merge using data step.

proc sort data=part_1; by i_id; run;
proc sort data=part_2; by i_id; run;

data want;
  merge part_1 part_2;
  by i_id;
run;

proc print data=want;
  by i_id;
  id i_id;
run;

Results:

 i_id   id_2      i_date  i_prev  i_time   id_3   id_i  g_type  g_procedure      g_date   id_4  i_id2  id_l  la  ma_2  mn_2    status

57321  22222  04/07/1998     0      20    22222  57321    PM         8       04/07/1998  22222  57321  7777   3   GA   Tip     exp


62233  12345  15/09/2003    99      30    12344  62233    PM         4       15/09/2003  12345  62233  5481   2   ME   sprint  exp
       12345  15/09/2003    99      30    12344  62233    PM         4       15/09/2003  12345  62233  5482   2   ME   sprint  exp


78341  12345  01/09/2012    99      30    78341  62233    PM         4       01/09/2012  12345  78341  8921   2   ME   optis   exp


95732  12345  15/09/2015    99      30    95732  62233    PM         4       15/09/2015  12345  95732  4841   2   ME   optis   active


97325  22222  08/03/2001     0      18    22222  97325    PM         1       08/03/2001  22222  97325  3901   3   GA   Tip     active

Note: Your posted data step code for PART_2 is not going to work right becuase you included an informat in the INPUT statement without use the : modifier to make sure the line is parsed using LIST MODE.

data part_2; 
  input id_4 i_id2 id_l la ma_2 $ mn_2 $ status :$6. i_id; 
datalines;
12345 62233 5481 2 ME sprint exp 62233
12345 62233 5482 2 ME sprint exp 62233
12345 78341 8921 2 ME optis exp 78341
12345 95732 4841 2 ME optis active 95732
22222 57321 7777 3 GA Tip exp 57321
22222 97325 3901 3 GA Tip active 97325
;

rookie21 · Posted 12-20-2021 04:44 PM

Would like to merge on I_ID.

Whats called id_2 in data part_1 is equal to id_4 in part_2. However I dont need double of these. Thought I could delete on of the columns after successful merging.

The essential for me is to have all data corresponding to each id_2/4 on ONE row. Thats what I meant about the wide dataset.

For 12345 (id_2) I have 3 observations in data part_1 and in data part_2 I have 4 observations for this particular id.

I planned to do this in two steps:

1) merge everything by i_id

2) change the long to wide dataset (i_id 62233) has one row in data part_1 and 2 rows in data part_2. I would like to have them on same row, the second observation eventuel with the suffix: _2 to differ between nr 1 and 2 observaion for that particular i_id.

Now, Im not sure how I get

62233  12345  15/09/2003    99      30    12344  62233    PM         4       15/09/2003  12345  62233  5481   2   ME   sprint  exp
       12345  15/09/2003    99      30    12344  62233    PM         4       15/09/2003  12345  62233  5482   2   ME   sprint  exp

on one line same line.

At last I should end up with two rows corresponding only to two record_ids with all their information on same line like this:

12345 62233 15.09.2003 99 30 12344 62233 PM 4 15-09-2003 12345 78341 01-09-2012 99 30 78341 62233 PM 4 01-09-2012 12345 95732 15-09-2015 99 30 95732 62233 PM 4 15-09-2015 12345 62233 5481 2 ME sprint exp 62233 12345 62233 5482 2 ME sprint exp 62233 12345 78341 8921 2 ME optis exp 78341 12345 95732 4841 2 ME optis active 95732
22222 57321 04-07-1998 0  20 22222 57321 PM 8 04-07-1998 22222 97325 08-03-2001 0  18 22222 97325 PM 1 08-03-2001 22222 57321 7777 3 GA Tip exp 57321 22222 97325 3901 3 GA Tip active 97325

@PaigeMiller were also nice to mention the semicolon in my input statement. I changed it and that worked

This merging step now worked:
data wide;
merge part_1 part_2;
by i_id;
run;

Gave me this output:

Which data- og proc statement can I use to get the output collected?

Hope you guys can help my

Tom · Posted 12-20-2021 01:04 PM

Your subject line mentions LONG to WIDE format, but your question seems to be about either MERGE statement or possibly proper use of informats with the iNPUT statement.

What is your question about LONG to WIDE?

Can you describe the input and output structures?

Are you trying to convert multiple observations into a single one? Why ? What do you plan to do with it that requires a single observation?

rookie21 · Posted 12-20-2021 04:53 PM

Yea, I guess I hoped to solve my question by divided in

1, merging step

2, making wide step.

My question about long to wide is - as I have multiple events (actually up to 20 for each i_id), how can I convert it to wide ?

I was thinking proc transpose, but using it the simple way:

proc transpose data=want out=answer; 
run;

I get:

_NAME_	COL1	COL2	COL3	COL4	COL5	COL6
id_2	22222	12345	12345	12345	12345	22222
i_id	57321	62233	62233	78341	95732	97325
i_date	14064	15963	15963	19237	20346	15042
i_prev	0	99	99	99	99	0
i_time	20	30	30	30	30	18
id_3	22222	12344	12344	78341	95732	22222
id_i	57321	62233	62233	62233	62233	97325
g_procedure	8	4	4	4	4	1
g_date	14064	15963	15963	19237	20346	15042
id_4	22222	12345	12345	12345	12345	22222
i_id2	57321	62233	62233	78341	95732	97325
id_l	7777	5481	5482	8921	4841	3901
la	3	2	2	2	2	3

which is not exactly what I want... Mayby I need some by statement?
As I have different variable names, seems I cant just use "var" statement 😕

Tom · Posted 12-20-2021 05:55 PM

You need to explain more about what your data represents to get a better answer.

In general the reason you have multiple observations is you have repeated measures for some variable. For example you might measure heart rate multiple times.

data have;
  input id hr ;
cards;
1 75
1 70
2 56
2 50
3 80
3 79
;

In general leaving the data in this form is easier for analysis.

If you want to transpose the data from long to wide you can use PROC TRANSPOSE.

proc transpose data=have out=want(drop=_name_) prefix=hr ;
  by id;
  var hr;
run;

Which yields a dataset like this:

data want;
  input id hr1 hr2;
cards;
1     75     70
2     56     50
3     80     79
;

But that works best with simple examples like the one above where there is only one variable that is measured multiple times.

rookie21 · Posted 12-21-2021 01:40 AM

yes, I see how the transposing is working with only one repeated variable. But I have a lot which makes it difficult for me to get right

In my case, I have for each patient :

- multiple operations

- mutilpe types of operations.

So I want outcome to be in this format:

Patient_1 Operation_1 Operation_type_info_1 Operation_2 Operation_type_info_2 Operation_3 Operation_type_info_3_part1 Operation_type_info_3_part2

Patient_2 Operation_1 Operation_type_info_1 Operation_2 Operation_type_info_2

So data from id_2 -- g_date represents operations (dataset part_1).

And data from id_4 -- i_id (dataset part_2)

The id_2 and id_4 is the same number as this represent each patient in different dataset (part_1 and part_2). i_id represents operation code beeing the only operator that connect data from operation and type of operation.

Sorted by the i_date I need eldeste operation to be first in the line and newest operation to be last in the line (already proc sorted i_date in my early step)

I showed the expected output earlier with only 2 lines including all information for each patient gathered in one line.

The reason I should get it on one line is to do analysis like time between operation, characterics about percentage of operation type for first, second,... operation. Hope this give a better understanding of my data ?

Kurt_Bremser · Posted 12-21-2021 04:31 AM

Such statistics can be better derived from the long layout. Wide layout is needed only for regression analysis (where you would spread out the parameters like frequency of operation type or time between operation that you derive from the long dataset).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!