I am working on a secure server (which means I can not copy paste anything from it) with several large datasets that I have merged into one. for the current study, I need to use an ID number called pnr, a data and price from four different databases and a population number from 1-6. The aim is to create a total price for each ID and do statistical analyses on that. BUT I have encountered a problem. I have merged my datasets by the ID number and now each Id number has more that 20 lines each. (in a dataset with more than 18 000 ID numbers it makes it impossible to overview). I have tried many different ways of making it broad instead of long so that I can make a total variable but I have not succeeded. Hereunder I will try to go through the steps i have tried and done:
data all;
merge LMDB DRG SSSY DTSDCCG UDDA;
by pnr;
run;
that creates a very long dataset with several million observations.
Then I tried to do a proc sort step like this
PROC SORT DATA=all
DUPOUT=allnodup
NODUPRECS ;
BY pnr ;
RUN ;
proc print allnodup;
run;
which ended up running for more than a day using a lot of capacity unsuccessfully.
then I tried this
proc sql ;
create table allnodup as
select DISTINCT (pnr),
price1,
date1,
price2,
date2,
price 3,
date3
from all
order by pnr ;
quit ;
which did eliminate some
so then I did continue with with a
proc transpose data=allnodup out=allw prefix=price1;
by pnr;
id date1;
run;
proc print data=allw;
run;
this gave we a lot errors in the log saying the "Id value...... occures twice" and finaly "to many bad ID values"
I even tried to make to categories manually in a data step, defining payments/month but it seems as if there are seceral that have more price values every month.
Hereunder I tried to create an example of how it looks:
data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyys10. price2:8. date2:mmddyys10. price3:8. date3:mmddyys10. population:8.
;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 02012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 5701,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887755,8700, 04022018,245761,03042017,7001,02032016,2
887755,0, 04022018,245761,03042017,7001,02032016,2
887755,3300, 04022018,245761,03042017,7001,02032016,2
887755,3200, 04022018,241,03052017,7001,02032016,2
887755,3200, 04022018,245761,03042017,0,02032017,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887744,9700,01012015,99838,01032018,5056,10252018,5
887744,9763,01012014,99858,01032018,5056,10252018,5
887744,9763,01012014,55,01072016,5056,10252018,5
887733,3387,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,789,02032017,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,1700,12052016,30050,12012014,3
887733,3377,07162015,45000,02032016,308,12011014,3
887733,1500,07162015,45000,02032016,30050,12012014,3
887733,1500,07192015,45000,02032016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
887722,4372,07162017,0,03042018,47531,11222014,6
887722,5678,07152017,0,03042018,47531,11222014,6
887722,4352,07152017,750,03052018,47531,11222014,6
887722,4352,07152017,0,03042018,4700,11252014,6
;;;;
title "All";
proc print;
run;
Obs | PNR | Price1 | date1 | price2 | date2 | price3 | date3 | population |
1 | 887766 | 1500 | 1012014 | 45000 | 1012014 | 55000 | 1052018 | 1 |
2 | 887766 | 1703 | 3042015 | 45000 | 1012014 | 55000 | 1052018 | 1 |
3 | 887766 | 1703 | 3042015 | 45000 | 2012014 | 55000 | 1052018 | 1 |
4 | 887766 | 1703 | 3042015 | 45000 | 1012014 | 55000 | 1052018 | 1 |
5 | 887766 | 1703 | 3042015 | 45000 | 1012014 | 5701 | 1052018 | 1 |
6 | 887755 | 3200 | 4022018 | 245761 | 3042017 | 7001 | 2032016 | 2 |
7 | 887755 | 8700 | 4022018 | 245761 | 3042017 | 7001 | 2032016 | 2 |
8 | 887755 | 0 | 4022018 | 245761 | 3042017 | 7001 | 2032016 | 2 |
9 | 887755 | 3300 | 4022018 | 245761 | 3042017 | 7001 | 2032016 | 2 |
10 | 887755 | 3200 | 4022018 | 241 | 3052017 | 7001 | 2032016 | 2 |
11 | 887755 | 3200 | 4022018 | 245761 | 3042017 | 0 | 2032017 | 2 |
12 | 887744 | 9763 | 1012014 | 99838 | 1032018 | 5056 | 10252018 | 5 |
13 | 887744 | 9700 | 1012015 | 99838 | 1032018 | 5056 | 10252018 | 5 |
14 | 887744 | 9763 | 1012014 | 99858 | 1032018 | 5056 | 10252018 | 5 |
15 | 887744 | 9763 | 1012014 | 55 | 1072016 | 5056 | 10252018 | 5 |
16 | 887733 | 3387 | 7162015 | 45000 | 2032016 | 30050 | 12012014 | 3 |
17 | 887733 | 3377 | 7162015 | 45000 | 2032016 | 30050 | 12012014 | 3 |
18 | 887733 | 3377 | 7162015 | 789 | 2032017 | 30050 | 12012014 | 3 |
19 | 887733 | 3377 | 7162015 | 45000 | 2032016 | 30050 | 12012014 | 3 |
20 | 887733 | 3377 | 7162015 | 1700 | 12052016 | 30050 | 12012014 | 3 |
21 | 887733 | 3377 | 7162015 | 45000 | 2032016 | 308 | 12011014 | 3 |
22 | 887733 | 1500 | 7162015 | 45000 | 2032016 | 30050 | 12012014 | 3 |
23 | 887733 | 1500 | 7192015 | 45000 | 2032016 | 30050 | 12012014 | 3 |
24 | 887722 | 4352 | 7152017 | 0 | 3042018 | 47531 | 11222014 | 6 |
25 | 887722 | 4372 | 7162017 | 0 | 3042018 | 47531 | 11222014 | 6 |
26 | 887722 | 5678 | 7152017 | 0 | 3042018 | 47531 | 11222014 | 6 |
27 | 887722 | 4352 | 7152017 | 750 | 3052018 | 47531 | 11222014 | 6 |
28 | 887722 | 4352 | 7152017 | 0 | 3042018 | 4700 | 11252014 | 6 |
pnr: represents a social security number for a person but made with a secret key
costs are all in DKK
dates are all in date format except the one I did not enter hereover that is month and year format, the population is a group defined elsewhere there are 6 different groups.
So my aim is to create a total cost per person first per year and then overall and then compare between the populations.
So I am now stranded in a situation where I am not sure what I can do?
Maybe someone here can help me.
Please make up an example of your dataset ALL that illustrates your issue, and post it as a data step with datalines.
Right now it is hard to help you, because we have no idea what your data looks like.
Make up a data set that accurately represents your problem. Show it to us as SAS data step code, or via these instructions: https://blogs.sas.com/content/sastraining/2016/03/11/jedi-sas-tricks-data-to-data-step-macro/.
Thanks. This is closer to a description of your data. You should supply more information about what the variables mean. You should supply more information about what you are actually trying to do. Your original request made it sound like you just want to transpose from "tall" to "wide", but this data looks like it is already wide. You mention cleaning. What is it that you want to check? Are you having issues with invalid values (strings that cannot be read as dates for example?) Or some other more complex cross record checks, like end date before start date?
PS When using this forum to post code use the buttons above the edit/text box to open a pop-up window to paste/type the code. Use the same button to re-open the pop-up window to make edits to the code. The SAS code button looks like this:
So I tried to make an example
Obs | PNR | Price1 | date1 | price2 | date2 | price3 | date3 | population |
1 | 887766 | 1500 | 1012014 | 45000 | 1012014 | 55000 | 1052018 | 1 |
2 | 887755 | 3200 | 4022018 | 245761 | 3042017 | 7001 | 2032016 | 2 |
3 | 887744 | 9763 | 1012014 | 99838 | 1032018 | 5056 | 10252018 | 5 |
4 | 887733 | 3377 | 7162015 | 45000 | 2031016 | 30050 | 12012014 | 3 |
5 | 887722 | 4352 | 7152017 | 0 | 3042018 | 47531 | 11222014 | 6 |
except theres is another columne of price and date and that date is given in month and year. and then yes there are another 18000 rows in the original. does it help you helping me?
Repeating the earlier request: "Show it to us as SAS data step code, or via these instructions: https://blogs.sas.com/content/sastraining/2016/03/11/jedi-sas-tricks-data-to-data-step-macro/."
Are you showing us the long data set, or the broad data set?
That looks like the output of PROC PRINT, perhaps copied from ODS output instead of normal listing output?
To share the data it would be better to dump it to the log using DSD option.
data _null_;
set have (obs=5);
file log dsd ;
put (_all_) (+0);
run;
Then you could copy and paste those comma separated values into in-line data (CARDS aka DATALINES) and write an input statement to read them back into SAS variables. Then people trying to help can just copy the code into SAS and create a copy of your sample data.
data have;
infile cards dsd truncover ;
input PNR Price1 date1 price2 date2 price3 date3 population;
cards;
887766,1500,1012014,45000,1012014,55000,1052018,1
887755,3200,4022018,245761,3042017,7001,2032016,2
887744,9763,1012014,99838,1032018,5056,10252018,5
887733,3377,7162015,45000,2031016,30050,12012014,3
887722,4352,7152017,0,3042018,47531,11222014,6
;
Do you really store the dates as numbers like the 1,012,014 value of DATE2 that you have in that report?
If so why are you not storing them as dates?
I have revised my original post. I hope it all makes better sense now
Correct your code like this:
data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyy10. price2:8. date2:mmddyy10. price3:8. date3:mmddyy10. population:8.
;
format date: yymmdd10.;
datalines4;
so that the dates are read as SAS dates.
Now, what output do you expect for ID 887766, just to have one example?
Instead of making a dataset broad(er), you always strive for making it narrow, or long instead of wide (see Maxim 19).
This reduces the number of columns and therefore the coding effort needed.
So you run this data steps:
data all;
infile datalines dsd truncover;
input PNR:$13. Price1:8. date1:mmddyy10. price2:8. date2:mmddyy10. price3:8. date3:mmddyy10. population:8.
;
format date: yymmdd10.;
datalines4;
887766,1500, 01012014,45000,01012014,55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 02012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 55000,01052018,1
887766, 1703,03042015, 45000, 01012014, 5701,01052018,1
887755,3200, 04022018,245761,03042017,7001,02032016,2
887755,8700, 04022018,245761,03042017,7001,02032016,2
887755,0, 04022018,245761,03042017,7001,02032016,2
887755,3300, 04022018,245761,03042017,7001,02032016,2
887755,3200, 04022018,241,03052017,7001,02032016,2
887755,3200, 04022018,245761,03042017,0,02032017,2
887744,9763,01012014,99838,01032018,5056,10252018,5
887744,9700,01012015,99838,01032018,5056,10252018,5
887744,9763,01012014,99858,01032018,5056,10252018,5
887744,9763,01012014,55,01072016,5056,10252018,5
887733,3387,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,789,02032017,30050,12012014,3
887733,3377,07162015,45000,02032016,30050,12012014,3
887733,3377,07162015,1700,12052016,30050,12012014,3
887733,3377,07162015,45000,02032016,308,12011014,3
887733,1500,07162015,45000,02032016,30050,12012014,3
887733,1500,07192015,45000,02032016,30050,12012014,3
887722,4352,07152017,0,03042018,47531,11222014,6
887722,4372,07162017,0,03042018,47531,11222014,6
887722,5678,07152017,0,03042018,47531,11222014,6
887722,4352,07152017,750,03052018,47531,11222014,6
887722,4352,07152017,0,03042018,4700,11252014,6
;;;;
data want;
set all;
by pnr notsorted;
array years {1900:2050} _temporary_;
array dates {*} date:;
array prices {*} price:;
keep pnr year total_cost;
if first.pnr then call missing(of years{*});
do i = 1 to dim(dates);
if dates{i} ne . then years{year(dates{i})} + prices{i};
end;
if last.pnr then do year = 1900 to 2050;
if years{year} ne .
then do;
total_cost = years{year};
output;
end;
end;
run;
which, without any further sorting, gives you just one observation per PNR and year.
From which you can easily run a report like this:
proc report data=want;
column pnr year total_cost;
define pnr / group;
define year / display;
define total_cost / analysis sum;
break after pnr / summarize;
run;
I am very sorry but I think I need a bit further explanaition to understand what I am doing and to fit it with my real dataset.
so the first 3 lines makes sense. except that my file is sorted with a noduprec statement.
line 4: years (does that define the variable or array or it my variables covering years? and why temporary?
well a bit the same with next two lines: is it dates/prices or date/price (after the asterix) that is avriable?
I am skiping the keep statement as there are a few other variables that I need for subanalysis, like gender, bmi etc.
the next line I do not understand. what does it do? actualy I am not sure what any of the rest does.
I can not make it work in my original dataset, probably because I do not understand the code
data want;
set all;
by pnr notsorted;
array years {1900:2050} _temporary_;
array dates {*} date:;
array prices {*} price:;
keep pnr year total_cost;
if first.pnr then call missing(of years{*});
do i = 1 to dim(dates);
if dates{i} ne . then years{year(dates{i})} + prices{i};
end;
if last.pnr then do year = 1900 to 2050;
if years{year} ne .
then do;
total_cost = years{year};
output;
end;
end;
run;
which, without any further sorting, gives you just one observation per PNR and year.
From which you can easily run a report like this:
proc report data=want;
column pnr year total_cost;
define pnr / group;
define year / display;
define total_cost / analysis sum;
break after pnr / summarize;
run;
and why temporary?
With your data as posted, I could also have used the DESCENDING option in the BY statement; I just wanted to be sure it works.
Line 4 creates an array as a pure memory structure (because of _TEMPORARY_) that covers all possible years with a numeric item for each. The array covers all years that are somehow probable (you can expand or shrink it as makes sense in your context).
_TEMPORARY_ has several effects:
The next two lines define arrays for the date and price variables in the incoming dataset. It is assumed that you will have an equal number of those.
The KEEP statement solves two purposes: it keeps only the variables needed (add yours here), and it helps define the order of the variable in the output (note that this is the first place in the code where year and total_cost appear).
At the beginning of a group (first.pnr) I clear the array, and then I scan all the dates/prices in every observation and add them up in their respective years. See the documentation for the use of OF in data step functions here.
Next, in every observation, I scan through the dates and add the matching price to the respective year sum (year derived from the date), if the date is not missing.
When a group finishes (last.pnr), I iterate through the array and write an observation for every year that has a non-missing value.
As a result, you get a very neat dataset, keeping only the necessary variables and observations (no space wasted for missing values etc).
Registration is now open for SAS Innovate 2025 , our biggest and most exciting global event of the year! Join us in Orlando, FL, May 6-9.
Sign up by Dec. 31 to get the 2024 rate of just $495.
Register now!
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.