I have a dataset of 5000 observations and about 200 variables by user id. A little more than 1000 users have multiple entries. These aren't duplicate entries. They have different values for some of those 200 variables. They are not the same variables for all users that have multiple entries. See example data set below.
data fake_data;
input user_id 3. Var1 2. Var2 3. Var3 2. Var4 3. Var5 2. Var6 $;
datalines;
03 0 10 0 31 0 .
03 0 11 1 31 1 .
04 0 00 1 10 0 .
04 0 47 0 21 1 .
04 0 47 0 21 0 .
04 0 74 0 21 0 .
05 0 00 0 22 1 .
05 0 15 0 22 0 .
06 0 00 9 26 0 .
06 0 00 9 26 0 .
06 0 00 9 26 0 .
07 0 00 0 72 0 .
07 5 00 0 84 1 .
07 0 00 9 84 0 .
08 0 00 0 32 1 .
08 0 00 9 32 0 .
09 0 54 0 54 0 Excel
09 0 54 0 54 0 Excel
10 0 73 0 18 0 .
10 0 00 9 18 0 .
10 0 23 0 18 1 .
10 0 93 0 18 0 .
11 0 37 0 12 1 .
11 0 43 0 12 0 .
11 0 43 0 12 1 .
12 0 40 0 15 0 .
12 0 00 0 15 0 .
13 1 00 0 16 0 .
13 0 63 0 16 1 .
;
I need to turn this into a wide data set where it's one row per user id without having columns that duplicate data. That is, when there is different values make column wide, do not duplicate columns with same value. How can I do this when the data is so random? I can work in SAS, Excel, or R.
Why do you think you need to dedup into columns? Deduping into rows is a lot easier and will make analysis a lot easier. A dedup by row could look like this:
userid VarID Value
03 var1 0
03 var2 10
03 var3 0
03 var4 31
03 var5 0
03 var6 .
03 var2 11
03 var3 1
03 var5 1
Why do you think you need to dedup into columns? Deduping into rows is a lot easier and will make analysis a lot easier. A dedup by row could look like this:
userid VarID Value
03 var1 0
03 var2 10
03 var3 0
03 var4 31
03 var5 0
03 var6 .
03 var2 11
03 var3 1
03 var5 1
I have no idea what you mean by "make column wide".
You can make a character variable longer (to a limit of 32767 bytes) so it could store multiple values. But you cannot store multiple numbers into one single value.
/*
I think you'd better to post the output you want to see
to explain your question better.
The simplest way to do this is using PROC SUMMARRY.
*/
data fake_data;
input user_id 3. Var1 2. Var2 3. Var3 2. Var4 3. Var5 2. Var6 $;
datalines;
03 0 10 0 31 0 .
03 0 11 1 31 1 .
04 0 00 1 10 0 .
04 0 47 0 21 1 .
04 0 47 0 21 0 .
04 0 74 0 21 0 .
05 0 00 0 22 1 .
05 0 15 0 22 0 .
06 0 00 9 26 0 .
06 0 00 9 26 0 .
06 0 00 9 26 0 .
07 0 00 0 72 0 .
07 5 00 0 84 1 .
07 0 00 9 84 0 .
08 0 00 0 32 1 .
08 0 00 9 32 0 .
09 0 54 0 54 0 Excel
09 0 54 0 54 0 Excel
10 0 73 0 18 0 .
10 0 00 9 18 0 .
10 0 23 0 18 1 .
10 0 93 0 18 0 .
11 0 37 0 12 1 .
11 0 43 0 12 0 .
11 0 43 0 12 1 .
12 0 40 0 15 0 .
12 0 00 0 15 0 .
13 1 00 0 16 0 .
13 0 63 0 16 1 .
;
proc sql noprint;
select max(n) into :n
from (select count(*) as n from fake_data group by user_id);
quit;
proc summary data=fake_data;
by user_id;
output out=want(drop=_freq_ _type_) idgroup(out[&n.] ( Var1-Var6)=);
run;
Join us for SAS Innovate 2025, our biggest and most exciting global event of the year, in Orlando, FL, from May 6-9. Sign up by March 14 for just $795.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.