Re: Making a long merged data set broad

Jannie_D · Posted 05-04-2021 04:59 AM

I am working on a secure server (which means I can not copy paste anything from it) with several large datasets that I have merged into one. for the current study, I need to use an ID number called pnr, a data and price from four different databases and a population number from 1-6. The aim is to create a total price for each ID and do statistical analyses on that. BUT I have encountered a problem. I have merged my datasets by the ID number and now each Id number has more that 20 lines each. (in a dataset with more than 18 000 ID numbers it makes it impossible to overview). I have tried many different ways of making it broad instead of long so that I can make a total variable but I have not succeeded. Hereunder I will try to go through the steps i have tried and done:

data all;

merge LMDB DRG SSSY DTSDCCG UDDA;

by pnr;

run;

that creates a very long dataset with several million observations.

Then I tried to do a proc sort step like this

PROC SORT DATA=all
DUPOUT=allnodup
NODUPRECS ;
BY pnr ;
RUN ;

proc print allnodup;

run;

which ended up running for more than a day using a lot of capacity unsuccessfully.

then I tried this

proc sql ;
create table allnodup as
select DISTINCT (pnr),
price1,
date1,
price2,
date2,
price 3,

date3
from all
order by pnr ;
quit ;

which did eliminate some

so then I did continue with with a

proc transpose data=allnodup out=allw prefix=price1;

by pnr;

id date1;

run;

proc print data=allw;

run;

this gave we a lot errors in the log saying the "Id value...... occures twice" and finaly "to many bad ID values"

I even tried to make to categories manually in a data step, defining payments/month but it seems as if there are seceral that have more price values every month.

Hereunder I tried to create an example of how it looks:

data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyys10. price2:8. date2:mmddyys10. price3:8. date3:mmddyys10. population:8.
;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 02012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 5701,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887755,8700, 04022018,245761,03042017,7001,02032016,2
887755,0, 04022018,245761,03042017,7001,02032016,2
887755,3300, 04022018,245761,03042017,7001,02032016,2
887755,3200, 04022018,241,03052017,7001,02032016,2
887755,3200, 04022018,245761,03042017,0,02032017,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887744,9700,01012015,99838,01032018,5056,10252018,5
887744,9763,01012014,99858,01032018,5056,10252018,5
887744,9763,01012014,55,01072016,5056,10252018,5
887733,3387,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,789,02032017,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,1700,12052016,30050,12012014,3
887733,3377,07162015,45000,02032016,308,12011014,3
887733,1500,07162015,45000,02032016,30050,12012014,3
887733,1500,07192015,45000,02032016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
887722,4372,07162017,0,03042018,47531,11222014,6
887722,5678,07152017,0,03042018,47531,11222014,6
887722,4352,07152017,750,03052018,47531,11222014,6
887722,4352,07152017,0,03042018,4700,11252014,6
;;;;
title "All";
proc print;
run;

Obs	PNR	Price1	date1	price2	date2	price3	date3	population
1	887766	1500	1012014	45000	1012014	55000	1052018	1
2	887766	1703	3042015	45000	1012014	55000	1052018	1
3	887766	1703	3042015	45000	2012014	55000	1052018	1
4	887766	1703	3042015	45000	1012014	55000	1052018	1
5	887766	1703	3042015	45000	1012014	5701	1052018	1
6	887755	3200	4022018	245761	3042017	7001	2032016	2
7	887755	8700	4022018	245761	3042017	7001	2032016	2
8	887755	0	4022018	245761	3042017	7001	2032016	2
9	887755	3300	4022018	245761	3042017	7001	2032016	2
10	887755	3200	4022018	241	3052017	7001	2032016	2
11	887755	3200	4022018	245761	3042017	0	2032017	2
12	887744	9763	1012014	99838	1032018	5056	10252018	5
13	887744	9700	1012015	99838	1032018	5056	10252018	5
14	887744	9763	1012014	99858	1032018	5056	10252018	5
15	887744	9763	1012014	55	1072016	5056	10252018	5
16	887733	3387	7162015	45000	2032016	30050	12012014	3
17	887733	3377	7162015	45000	2032016	30050	12012014	3
18	887733	3377	7162015	789	2032017	30050	12012014	3
19	887733	3377	7162015	45000	2032016	30050	12012014	3
20	887733	3377	7162015	1700	12052016	30050	12012014	3
21	887733	3377	7162015	45000	2032016	308	12011014	3
22	887733	1500	7162015	45000	2032016	30050	12012014	3
23	887733	1500	7192015	45000	2032016	30050	12012014	3
24	887722	4352	7152017	0	3042018	47531	11222014	6
25	887722	4372	7162017	0	3042018	47531	11222014	6
26	887722	5678	7152017	0	3042018	47531	11222014	6
27	887722	4352	7152017	750	3052018	47531	11222014	6
28	887722	4352	7152017	0	3042018	4700	11252014	6

pnr: represents a social security number for a person but made with a secret key

costs are all in DKK

dates are all in date format except the one I did not enter hereover that is month and year format, the population is a group defined elsewhere there are 6 different groups.

So my aim is to create a total cost per person first per year and then overall and then compare between the populations.

So I am now stranded in a situation where I am not sure what I can do?

Maybe someone here can help me.

Kurt_Bremser · Posted 05-04-2021 05:11 AM

Please make up an example of your dataset ALL that illustrates your issue, and post it as a data step with datalines.

Right now it is hard to help you, because we have no idea what your data looks like.

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Jannie_D · Posted 05-04-2021 05:38 AM

I am not completly sure I understand what you mean?

PaigeMiller · Posted 05-04-2021 06:29 AM

Make up a data set that accurately represents your problem. Show it to us as SAS data step code, or via these instructions: https://blogs.sas.com/content/sastraining/2016/03/11/jedi-sas-tricks-data-to-data-step-macro/.

--
Paige Miller

Jannie_D · Posted 05-04-2021 09:38 AM

I thought that you wanted a look at how the data looked. Our data commes from big national databases and we did not make our dataset ourselves. our task is to clean and analyse it (and as a phd student with 6 months SAS experience it can be difficult when the amounts of data are this huge.
My suspicion is that there are more than one payment in one or more of the price categories and the same date. But I dont know how to get around that.

Herunder is the code I used to build the tanel above:
data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyys10. price2:8. date2:mmddyys10. price3:8. date3:mmddyys10. population:8.
;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887733,3377,07162015,45000,02031016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
;;;;
title "All";
proc print;
run;

Tom · Posted 05-04-2021 10:15 AM

Thanks. This is closer to a description of your data. You should supply more information about what the variables mean. You should supply more information about what you are actually trying to do. Your original request made it sound like you just want to transpose from "tall" to "wide", but this data looks like it is already wide. You mention cleaning. What is it that you want to check? Are you having issues with invalid values (strings that cannot be read as dates for example?) Or some other more complex cross record checks, like end date before start date?

PS When using this forum to post code use the buttons above the edit/text box to open a pop-up window to paste/type the code. Use the same button to re-open the pop-up window to make edits to the code. The SAS code button looks like this:

Jannie_D · Posted 05-04-2021 09:19 AM

So I tried to make an example

Obs	PNR	Price1	date1	price2	date2	price3	date3	population
1	887766	1500	1012014	45000	1012014	55000	1052018	1
2	887755	3200	4022018	245761	3042017	7001	2032016	2
3	887744	9763	1012014	99838	1032018	5056	10252018	5
4	887733	3377	7162015	45000	2031016	30050	12012014	3
5	887722	4352	7152017	0	3042018	47531	11222014	6

except theres is another columne of price and date and that date is given in month and year. and then yes there are another 18000 rows in the original. does it help you helping me?

PaigeMiller · Posted 05-04-2021 09:26 AM

Repeating the earlier request: "Show it to us as SAS data step code, or via these instructions: https://blogs.sas.com/content/sastraining/2016/03/11/jedi-sas-tricks-data-to-data-step-macro/."

Are you showing us the long data set, or the broad data set?

--
Paige Miller

Tom · Posted 05-04-2021 09:59 AM

That looks like the output of PROC PRINT, perhaps copied from ODS output instead of normal listing output?

To share the data it would be better to dump it to the log using DSD option.

data _null_;
  set have (obs=5);
  file log dsd ;
  put (_all_) (+0);
run;

Then you could copy and paste those comma separated values into in-line data (CARDS aka DATALINES) and write an input statement to read them back into SAS variables. Then people trying to help can just copy the code into SAS and create a copy of your sample data.

data have;
  infile cards dsd truncover ;
  input PNR Price1 date1 price2 date2 price3 date3 population;
cards;
887766,1500,1012014,45000,1012014,55000,1052018,1
887755,3200,4022018,245761,3042017,7001,2032016,2
887744,9763,1012014,99838,1032018,5056,10252018,5
887733,3377,7162015,45000,2031016,30050,12012014,3
887722,4352,7152017,0,3042018,47531,11222014,6
;

Do you really store the dates as numbers like the 1,012,014 value of DATE2 that you have in that report?

If so why are you not storing them as dates?

Jannie_D · Posted 05-05-2021 05:13 AM

I have revised my original post. I hope it all makes better sense now

Kurt_Bremser · Posted 05-05-2021 05:28 AM

Correct your code like this:

data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyy10. price2:8. date2:mmddyy10. price3:8. date3:mmddyy10. population:8.
;
format date: yymmdd10.;
datalines4;

so that the dates are read as SAS dates.

Now, what output do you expect for ID 887766, just to have one example?

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Jannie_D · Posted 05-05-2021 05:45 AM

Thank you for the help with formatting the date.
each pnr-number should end up having a total cost per year and a total allover cost. As I see it I need to make the dataset broad to do that. but somehow I am not succeeding.
så that I for each person can create ie. cost12014, cost22014 etc and totalcost2014 and then for the other years and finally for all years and all cost and then compare between populations. does that make sense?

to sum the costs I do need them on the same line right?

Kurt_Bremser · Posted 05-05-2021 06:06 AM

Instead of making a dataset broad(er), you always strive for making it narrow, or long instead of wide (see Maxim 19).

This reduces the number of columns and therefore the coding effort needed.

So you run this data steps:

data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyy10. price2:8. date2:mmddyy10. price3:8. date3:mmddyy10. population:8.
;
format date: yymmdd10.;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 02012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 5701,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887755,8700, 04022018,245761,03042017,7001,02032016,2
887755,0, 04022018,245761,03042017,7001,02032016,2
887755,3300, 04022018,245761,03042017,7001,02032016,2
887755,3200, 04022018,241,03052017,7001,02032016,2
887755,3200, 04022018,245761,03042017,0,02032017,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887744,9700,01012015,99838,01032018,5056,10252018,5
887744,9763,01012014,99858,01032018,5056,10252018,5
887744,9763,01012014,55,01072016,5056,10252018,5
887733,3387,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,789,02032017,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,1700,12052016,30050,12012014,3
887733,3377,07162015,45000,02032016,308,12011014,3
887733,1500,07162015,45000,02032016,30050,12012014,3
887733,1500,07192015,45000,02032016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
887722,4372,07162017,0,03042018,47531,11222014,6
887722,5678,07152017,0,03042018,47531,11222014,6
887722,4352,07152017,750,03052018,47531,11222014,6
887722,4352,07152017,0,03042018,4700,11252014,6
;;;;

data want;
set all;
by pnr notsorted;
array years {1900:2050} _temporary_;
array dates {*} date:;
array prices {*} price:;
keep pnr year total_cost;
if first.pnr then call missing(of years{*});
do i = 1 to dim(dates);
  if dates{i} ne . then years{year(dates{i})} + prices{i};
end;
if last.pnr then do year = 1900 to 2050;
  if years{year} ne .
  then do;
    total_cost = years{year};
    output;
  end;
end;
run;

which, without any further sorting, gives you just one observation per PNR and year.

From which you can easily run a report like this:

proc report data=want;
column pnr year total_cost;
define pnr / group;
define year / display;
define total_cost / analysis sum;
break after pnr / summarize;
run;

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Jannie_D · Posted 05-05-2021 08:30 AM

I am very sorry but I think I need a bit further explanaition to understand what I am doing and to fit it with my real dataset.

so the first 3 lines makes sense. except that my file is sorted with a noduprec statement.

line 4: years (does that define the variable or array or it my variables covering years? and why temporary?

well a bit the same with next two lines: is it dates/prices or date/price (after the asterix) that is avriable?

I am skiping the keep statement as there are a few other variables that I need for subanalysis, like gender, bmi etc.

the next line I do not understand. what does it do? actualy I am not sure what any of the rest does.

I can not make it work in my original dataset, probably because I do not understand the code

data want;
set all;
by pnr notsorted;
array years {1900:2050} _temporary_;
array dates {*} date:;
array prices {*} price:;
keep pnr year total_cost;
if first.pnr then call missing(of years{*});
do i = 1 to dim(dates);
  if dates{i} ne . then years{year(dates{i})} + prices{i};
end;
if last.pnr then do year = 1900 to 2050;
  if years{year} ne .
  then do;
    total_cost = years{year};
    output;
  end;
end;
run;

which, without any further sorting, gives you just one observation per PNR and year.

From which you can easily run a report like this:

proc report data=want;
column pnr year total_cost;
define pnr / group;
define year / display;
define total_cost / analysis sum;
break after pnr / summarize;
run;

and why temporary?

Kurt_Bremser · Posted 05-05-2021 08:50 AM

With your data as posted, I could also have used the DESCENDING option in the BY statement; I just wanted to be sure it works.

Line 4 creates an array as a pure memory structure (because of _TEMPORARY_) that covers all possible years with a numeric item for each. The array covers all years that are somehow probable (you can expand or shrink it as makes sense in your context).

_TEMPORARY_ has several effects:

all the items have the same size, so addressing a single item is _very_ fast
the individual items have no names (as in other arrays that consist of variables in the PDV)
nothing of it can appear in the output (b/c no names)
the array is automatically retained

The next two lines define arrays for the date and price variables in the incoming dataset. It is assumed that you will have an equal number of those.

The KEEP statement solves two purposes: it keeps only the variables needed (add yours here), and it helps define the order of the variable in the output (note that this is the first place in the code where year and total_cost appear).

At the beginning of a group (first.pnr) I clear the array, and then I scan all the dates/prices in every observation and add them up in their respective years. See the documentation for the use of OF in data step functions here.

Next, in every observation, I scan through the dates and add the matching price to the respective year sum (year derived from the date), if the date is not missing.

When a group finishes (last.pnr), I iterate through the array and write an observation for every year that has a non-missing value.

As a result, you get a very neat dataset, keeping only the necessary variables and observations (no space wasted for missing values etc).

Maxims of Maximally Efficient SAS Programmers
How to convert datasets to data steps
The macro for direct download as ZIP
How to post code
Please vote for Provide Sequential Search Capability for Hash Objects
How to deal with locked files on UNIX

Ready to join fellow brilliant minds for the SAS Hackathon?

Classroom Training Available!