About SASJedi

SASJedi

What do you mean by "size 4"? Are you saying that the PostgreSQL table columns are VARCHAR(4) but SAS is converting them to CHAR(1024)?

SASJedi

To kind of do it all at once: data have; y='pn415pn418 pn415pn418 pn414 pn415pn417pn413p n415pn418 pn417pn417 pn415'; run; data want; set have; /* Remove all spaces. Put a | delimiter in front of each 'pn'*/ y=tranwrd(compress(y),'pn','|pn'); /* countw will count the number of values in y */ do word=1 to countw(y); /* Extract each value */ Text=scan(y,word); /* Do whatver you want with the value. */ put word= Text=; end; run;

SASJedi

I just ran this and got the results I showed you. Can you please run the test program above in a fresh SAS session and then share the complete log?

SASJedi

data want (drop=second); set catletters; length ConcatText $50; retain ConcatText ; by first; if first.first then call missing(concatText); concatText=catx(',',concatText,second); if last.first then output; run;

SASJedi

SASJedi

Don't use PUT, just use the formatted numeric value. For example: data have; format date mmddyy10. value dollar10.; do Date='01JAN2024'd to '10JUL2024'd by 7; value+month(date); output; end; run; proc means data=have sum maxdec=2; class date; format date monyy7.; var value; run; Result: Analysis Variable : value date N Obs Sum JAN24 5 15.00 FEB24 4 40.00 MAR24 4 82.00 APR24 5 185.00 MAY24 4 230.00 JUN24 4 320.00 JUL24 2 199.00

SASJedi

This DATA step builds your input data: data have; infile datalines dsd dlm='|'; input ID:$6. Outcome Outcome_timepoint:$3. T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20; datalines; XXXXX1|1|T5|0|1|1|0|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|3|0 XXXXX2|0|NA|0|1|1|0|1|1|1|0|0|0|0|1|1|0|1|1|1|0|0|0|5|5 XXXXX3|1|T13|0|1|1|0|1|1|1|0|0|0|0|1|1|0|0|0|0|0|0|0|5|2 XXXXX4|0|NA|0|1|1|0|1|1|1|0|0|0|0|1|1|0|1|1|1|0|0|0|5|5 ; This DATA step performs the actions you specified: data want; drop i; set have; array t[*] t:; if outcome=0 then do; PT_pre=sum(of t1-t9); PT_Post=sum(of t10-t20); end; else if outcome=1 then do; do i=1 to input(compress(outcome_timepoint,,'kd'),32.); PT_Pre=sum(PT_pre,t[i]); end; do i=10 to input(compress(outcome_timepoint,,'kd'),32.); PT_Post=sum(PT_pre,t[i]); end; end; else put "WARNING: Bad outcome value for " ID= outcome=; run; And this is the result: Obs ID Outcome Outcome_timepoint T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 PT_pre PT_Post 1 XXXXX1 1 T5 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 . 2 XXXXX2 0 NA 0 1 1 0 1 1 1 0 0 0 0 1 1 0 1 1 1 0 0 0 5 5 3 XXXXX3 1 T13 0 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 7 8 4 XXXXX4 0 NA 0 1 1 0 1 1 1 0 0 0 0 1 1 0 1 1 1 0 0 0 5 5 Note that PT_POST for record 1 is missing, because you specified The person-time for patients with outcome=1 in post-stage should sum up the values from T10 until Tn, where Tn is determined by the value of outcome_timepoint The timepoint for observation 1 is T5, which is less than T10. What was your desired outcome for this row? Did you intend for this row to add up the values from T5 to T10?

SASJedi

In your report macro, remove AGE from the PROC REPORT BY statement: proc report data=output nowd;by sex ;

SASJedi

Cloud Analytic Services (CAS) is only available in SAS Viya. The overview for the GAMMOD Procedure states: The GAMMOD procedure fits generalized additive models that are based on low-rank regression splines (Wood 2006) in SAS Viya. (emphasis mine). If you are using CAS, you are using SAS Viya. You can log into the SAS Viya Compute Server using Enterprise Guide or SAS Studio and run traditional SAS code - the look and feel is just like SAS 9. Is it possible that's your configuration?

SASJedi

The native language of CAS is CASL, and the unit of work is the CAS action. This blog series by Panagiotis is a great place to start: CAS Action! - a series on fundamentals - SAS Viya Programming Other resources: Exploring SAS Viya: Programming and Data Management (free e-book) SAS Viya CAS Libraries (Caslibs) Simplified SAS Tutorial | Coding in SAS Viya I hope this helps to get you started. May the SAS be with you on your quest for knowledge!

SASJedi

This is definitely a bug. The easiest workaround is to re-apply the format when you change the label. proc sql; select pt ,question label='Y/N' format=yn. from ds1 union corr select * from ds2; quit; If this is a concern for you in your day-to-day SAS use, I recommend submitting it to SAS tech support for further evaluation via the Customer Service Portal - Customer Support (sas.com)

SASJedi

It's not what I would expect. I suspect you've found a bug. Using this code, the problem is reproducible in SAS9.4 M8 and Viya 2024.04. proc format; value yn 0='No' 1='Yes'; run; data ds1; format pt $2. question yn.; label pt ='Patient' question ='Yes or No'; pt='01'; question=0; output; pt='02'; question=1; output; run; data ds2; format pt $2. question yn.; label pt ='Patient' question ='Yes or No'; pt='03'; question=1; output; pt='04'; question=2; output; run; proc sql ; title "DS1"; select * from ds1; title "DS2"; select * from ds2; title "DS1 union DS2"; select pt , question from ds1 union corr select * from ds2; title "DS1 union DS2 "; title2 "Changing the label of the second column removes formatting"; select pt , question label='Y/N' from ds1 union corr select * from ds2; quit;

SASJedi

This code appears to have been generated by PROC IMPORT. I suspect that the input file includes some double-byte characters - in may be UTF-8 encoded instead of WLATIN-1. Try changing the FILENAME statement to specify UTF8: FILENAME REFFILE FILESRVC FOLDERPATH='/Projects/crm reporting/Encuestas' FILENAME='DV_Resultado_Campaña_ac.txt' encoding='UTF8'; Then re-run the code. Did that resolve the issue?

SASJedi · ‎06-10-2024

Look in the lower right-hand corner of the SAS Studio window. If there are any files available to Recover, you'll see a number next to the Recover icon: If so, you can click the icon to see a list of files you can recover. To recover the files, click "Recover All" then "Apply and Close": I hope this helps.

SASJedi · ‎06-07-2024

Duplicates in our data can badly skew the results of our analysis. In this post, I’ll cover data deduplication using PROC SORT with the NODUPKEY, OUT=, and DUPOUT= options. We will also look at using PROC SQL and PROC FedSQL for deduplication. It’s possible to deduplicate data that has messy text using PROC SORT if you have numeric columns that can be combined to make a row unique. Otherwise, the text data must be standardized so that identical values can be detected. First, a peek at the data. We’ll be working with the crime table. In this image, the rows we need to remove as duplicates are highlighted in red. The reason they should be discarded is noted in the Reason column. Of course, the actual crime table does not include the Disposition and Reason columns, or this would be a really easy task 😁 Row Date_Rptd DATE_OCC TIME_OCC AREA_NAME Rpt_Dist_No Vict_Age Vict_Sex Vict_Descent Status_Desc 1 2/6/2024 1/21/2024 0:16:40 N Hollywood 1503 31 F W Adult Arrest 2 2/1/2024 1/30/2024 0:27:25 Hollywood 622 52 F W Adult Arrest 3 2/1/2024 1/30/2024 0:27:25 HOLLYWOOD 622 52 F W ADULT ARREST 4 2/2/2024 2/2/2024 0:28:52 Wilshire 724 0 X X Adult Arrest 5 2/2/2024 2/2/2024 0:28:52 WILSHIRE 724 0 X X ADULT ARREST 6 2/4/2024 2/3/2024 0:32:15 Van Nuys 926 0 X X Adult Arrest 7 2/4/2024 2/3/2024 0:32:15 Van Nuys 926 0 x x ARREST 8 2/4/2024 2/4/2024 0:18:20 Mission 1908 37 F H Adult Arrest 9 2/4/2024 2/4/2024 0:18:20 MISSION 1908 37 F H ADULT ARREST 10 2/4/2024 2/4/2024 0:26:40 Harbor 587 39 F W Adult Arrest 11 2/4/2024 2/4/2024 0:26:40 HARBOR 587 39 F W ADULT ARREST 12 2/5/2024 2/4/2024 0:28:40 Northeast 1152 33 F B Juv Arrest 13 2/4/2024 2/4/2024 0:30:45 Hollywood 645 25 F B Adult Arrest 14 2/4/2024 2/4/2024 0:30:45 Hollywd 645 25 f b ARREST 15 2/4/2024 2/4/2024 0:34:15 Harbor 504 0 F H Adult Arrest 16 2/4/2024 2/4/2024 0:35:20 Olympic 2055 22 F H Adult Arrest 17 2/4/2024 2/4/2024 0:35:20 OLYMPIC 2055 22 F H ADULT ARREST 18 2/6/2024 2/6/2024 0:14:00 Wilshire 787 51 M W Adult Arrest 19 2/6/2024 2/6/2024 0:30:15 Harbor 567 45 F H Adult Arrest 20 2/6/2024 2/6/2024 0:30:15 HARBOR 567 45 F H ADULT ARREST 21 2/7/2024 2/7/2024 0:00:15 West LA 853 55 F O Adult Arrest 22 2/7/2024 2/7/2024 0:00:15 West LA 853 55 f o ARREST 23 2/7/2024 2/7/2024 0:13:45 Newton 1309 0 M A Adult Arrest 24 2/7/2024 2/7/2024 0:13:45 Newton 1309 0 M A Adult Arest 25 2/7/2024 2/7/2024 0:23:20 Southwest 393 0 X X Adult Arrest 26 2/7/2024 2/7/2024 0:23:20 Southwest 393 0 x x ARREST Rows 2 and 3,4 and 5, 8 and 9, 10 and 11, 16 and 17, and 19 and 20 were probably duplicated due to inconsistent text value casing. Row 7 has a truncated text value for Status_Desc, but is otherwise identical to row 6. Row 24 has a misspelling in the value for Status_Desc, but is otherwise identical to row 23. Our goal is to deduplicate this table, producing this final result: Row Date_Rptd DATE_OCC TIME_OCC AREA_NAME Rpt_Dist_No Vict_Age Vict_Sex Vict_Descent Status_Desc 1 2/7/2024 2/7/2024 0:00:15 West LA 853 55 F O Adult Arrest 2 2/7/2024 2/7/2024 0:13:45 Newton 1309 0 M A Adult Arrest 3 2/6/2024 2/6/2024 0:14:00 Wilshire 787 51 M W Adult Arrest 4 2/6/2024 1/21/2024 0:16:40 N Hollywood 1503 31 F W Adult Arrest 5 2/4/2024 2/4/2024 0:18:20 Mission 1908 37 F H Adult Arrest 6 2/7/2024 2/7/2024 0:23:20 Southwest 393 0 X X Adult Arrest 7 2/4/2024 2/4/2024 0:26:40 Harbor 587 39 F W Adult Arrest 8 2/1/2024 1/30/2024 0:27:25 Hollywood 622 52 F W Adult Arrest 9 2/5/2024 2/4/2024 0:28:40 Northeast 1152 33 F B Juv Arrest 10 2/2/2024 2/2/2024 0:28:52 Wilshire 724 0 X X Adult Arrest 11 2/6/2024 2/6/2024 0:30:15 Harbor 567 45 F H Adult Arrest 12 2/4/2024 2/4/2024 0:30:45 Hollywood 645 25 F B Adult Arrest 13 2/4/2024 2/3/2024 0:32:15 Van Nuys 926 0 X X Adult Arrest 14 2/4/2024 2/4/2024 0:34:15 Harbor 504 0 F H Adult Arrest 15 2/4/2024 2/4/2024 0:35:20 Olympic 2055 22 F H Adult Arrest As I look over the data, I can see that I’d need to standardize the text fields before I could use them for deduplication. But this table has a lot of numeric columns, and I’m willing to bet that the combination of Date_Rptd, DATE_OCC, TIME_OCC, Rpt_Dist_No, and Vict_Age will be unique for each row. In my previous post, we discussed sorting using PROC SORT – but that procedure can do so much more! I can use PROC SORT with the NODUPKEY option to deduplicate this data if I use all of these numeric variables in my BY statement. I’ll want to see the rows that got rejected, just so I can verify that my scheme is working. I can use the DUPOUT= option for that. The PROC SORT code looks like this: proc sort data=crime nodupkey out=nodups dupout=dups; by _numeric_; run; _NUMERIC_ is a very handy special SAS name variable list that specifies all numeric variables in a data set without having to list them individually. The results are just what I had hoped: Row Date_Rptd DATE_OCC TIME_OCC AREA_NAME Rpt_Dist_No Vict_Age Vict_Sex Vict_Descent Status_Desc 1 2/7/2024 2/7/2024 0:00:15 West LA 853 55 F O Adult Arrest 2 2/7/2024 2/7/2024 0:13:45 Newton 1309 0 M A Adult Arrest 3 2/6/2024 2/6/2024 0:14:00 Wilshire 787 51 M W Adult Arrest 4 2/6/2024 1/21/2024 0:16:40 N Hollywood 1503 31 F W Adult Arrest 5 2/4/2024 2/4/2024 0:18:20 Mission 1908 37 F H Adult Arrest 6 2/7/2024 2/7/2024 0:23:20 Southwest 393 0 X X Adult Arrest 7 2/4/2024 2/4/2024 0:26:40 Harbor 587 39 F W Adult Arrest 8 2/1/2024 1/30/2024 0:27:25 Hollywood 622 52 F W Adult Arrest 9 2/5/2024 2/4/2024 0:28:40 Northeast 1152 33 F B Juv Arrest 10 2/2/2024 2/2/2024 0:28:52 Wilshire 724 0 X X Adult Arrest 11 2/6/2024 2/6/2024 0:30:15 Harbor 567 45 F H Adult Arrest 12 2/4/2024 2/4/2024 0:30:45 Hollywood 645 25 F B Adult Arrest 13 2/4/2024 2/3/2024 0:32:15 Van Nuys 926 0 X X Adult Arrest 14 2/4/2024 2/4/2024 0:34:15 Harbor 504 0 F H Adult Arrest 15 2/4/2024 2/4/2024 0:35:20 Olympic 2055 22 F H Adult Arrest And the rejected rows are as expected: Row Date_Rptd DATE_OCC TIME_OCC AREA_NAME Rpt_Dist_No Vict_Age Vict_Sex Vict_Descent Status_Desc 1 2/7/2024 2/7/2024 0:00:15 West LA 853 55 f o ARREST 2 2/7/2024 2/7/2024 0:13:45 Newton 1309 0 M A Adult Arest 3 2/4/2024 2/4/2024 0:18:20 MISSION 1908 37 F H ADULT ARREST 4 2/7/2024 2/7/2024 0:23:20 Southwest 393 0 x x ARREST 5 2/4/2024 2/4/2024 0:26:40 HARBOR 587 39 F W ADULT ARREST 6 2/1/2024 1/30/2024 0:27:25 HOLLYWOOD 622 52 F W ADULT ARREST 7 2/2/2024 2/2/2024 0:28:52 WILSHIRE 724 0 X X ADULT ARREST 8 2/6/2024 2/6/2024 0:30:15 HARBOR 567 45 F H ADULT ARREST 9 2/4/2024 2/4/2024 0:30:45 Hollywd 645 25 f b ARREST 10 2/4/2024 2/3/2024 0:32:15 Van Nuys 926 0 x x ARREST 11 2/4/2024 2/4/2024 0:35:20 OLYMPIC 2055 22 F H ADULT ARREST But, what if there were not sufficient numeric columns to create a distinct identity for each row? For example, table work.crimes2: Row Date_Rptd AREA_NAME Vict_Sex Vict_Descent Status_Desc 1 2/1/2024 HOLLYWOOD F W ADULT ARREST 2 2/1/2024 Hollywood F W Adult Arrest 3 2/2/2024 WILSHIRE X X ADULT ARREST 4 2/2/2024 Wilshire X X Adult Arrest 5 2/4/2024 HARBOR F W ADULT ARREST 6 2/4/2024 Harbor F W Adult Arrest 7 2/4/2024 Harbor F H Adult Arrest 8 2/4/2024 Hollywd f b ARREST 9 2/4/2024 Hollywood F B Adult Arrest 10 2/4/2024 MISSION F H ADULT ARREST 11 2/4/2024 Mission F H Adult Arrest 12 2/4/2024 OLYMPIC F H ADULT ARREST 13 2/4/2024 Olympic F H Adult Arrest 14 2/4/2024 Van Nuys X X Adult Arrest 15 2/4/2024 Van Nuys x x ARREST 16 2/5/2024 Northeast F B Juv Arrest 17 2/6/2024 HARBOR F H ADULT ARREST 18 2/6/2024 Harbor F H Adult Arrest 19 2/6/2024 N Hollywood F W Adult Arrest 20 2/6/2024 Wilshire M W Adult Arrest 21 2/7/2024 Newton M A Adult Arrest 22 2/7/2024 Newton M A Adult Arest 23 2/7/2024 Southwest X X Adult Arrest 24 2/7/2024 Southwest x x ARREST 25 2/7/2024 West LA F O Adult Arrest 26 2/7/2024 West LA f o ARREST The character columns still contain irregularities, but now the only numeric column is Date_Rptd – and the values there are not unique. Using my previous trick of sorting BY _NUMERIC_, many rows are erroneously discarded as duplicates. Instead of the expected 15 rows, the nodups data set includes only 6: Row Date_Rptd AREA_NAME Vict_Sex Vict_Descent Status_Desc 1 2/1/2024 HOLLYWOOD F W ADULT ARREST 2 2/2/2024 WILSHIRE X X ADULT ARREST 3 2/4/2024 HARBOR F W ADULT ARREST 4 2/5/2024 Northeast F B Juv Arrest 5 2/6/2024 HARBOR F H ADULT ARREST 6 2/7/2024 Southwest x x ARREST To ensure I remove only rows that are complete duplicates of another, I’ll have to sort on the contents of all columns. SAS provides that handy _ALL_ special SAS name variable list that specifies all of the variables in a data set, saving me from having to type them all individually. But the data in those character columns is messy. I’ll need to standardize it before relying on it for de-duplication. In one of my previous posts, Coding for Data Quality in SAS Viya Part 2 – Standardization, I showcased some powerful tools available to SAS Viya programmers (and SAS 9 programmers who license SAS Data Quality) for standardizing data. In this case, I’ll use only Base SAS functionality in the DATA step. This will work fine for the simple issues found in this small data set. Note that the DATA step uses another handy variable list, this time the name prefix list: data crime_std; set crime2; array charvars[*] _character_; /* Drop all variables with names starting with an underscore */ drop _:; /* Make all character values UPPER case */ do _i=1 to dim(charvars); charvars[_i]= propcase(charvars[_i]); end; /* Correct the known spelling errors */ Status_Desc=tranwrd(Status_Desc,'Arest','Arrest'); AREA_NAME=tranwrd(AREA_NAME,'Hollywd','Hollywood'); /* Standardize the truncated values for Adult Arrest */ if Status_Desc='Arrest' then Status_Desc='Adult Arrest'; run; With the character data standardized, deduplication will be easy: Row Date_Rptd AREA_NAME Vict_Sex Vict_Descent Status_Desc 1 2/1/2024 Hollywood F W Adult Arrest 2 2/1/2024 Hollywood F W Adult Arrest 3 2/2/2024 Wilshire X X Adult Arrest 4 2/2/2024 Wilshire X X Adult Arrest 5 2/4/2024 Harbor F W Adult Arrest 6 2/4/2024 Harbor F H Adult Arrest 7 2/4/2024 Harbor F W Adult Arrest 8 2/4/2024 Hollywood F B Adult Arrest 9 2/4/2024 Hollywood F B Adult Arrest 10 2/4/2024 Mission F H Adult Arrest 11 2/4/2024 Mission F H Adult Arrest 12 2/4/2024 Olympic F H Adult Arrest 13 2/4/2024 Olympic F H Adult Arrest 14 2/4/2024 Van Nuys X X Adult Arrest 15 2/4/2024 Van Nuys X X Adult Arrest 16 2/5/2024 Northeast F B Juv Arrest 17 2/6/2024 Harbor F H Adult Arrest 18 2/6/2024 Harbor F H Adult Arrest 19 2/6/2024 N Hollywood F W Adult Arrest 20 2/6/2024 Wilshire M W Adult Arrest 21 2/7/2024 Newton M A Adult Arrest 22 2/7/2024 Newton M A Adult Arrest 23 2/7/2024 Southwest X X Adult Arrest 24 2/7/2024 Southwest X X Adult Arrest 25 2/7/2024 West La F O Adult Arrest 26 2/7/2024 West La F O Adult Arrest I’ll use PROC SORT NODUPKEY with a BY _ALL_ statement: proc sort data=crime_std out=nodups dupout=dups nodupkey ; by _all_; run; This produces the desired result and, as a bonus, the resulting data is much more presentable: Row Date_Rptd AREA_NAME Vict_Sex Vict_Descent Status_Desc 1 2/1/2024 Hollywood F W Adult Arrest 2 2/2/2024 Wilshire X X Adult Arrest 3 2/4/2024 Harbor F H Adult Arrest 4 2/4/2024 Harbor F W Adult Arrest 5 2/4/2024 Hollywood F B Adult Arrest 6 2/4/2024 Mission F H Adult Arrest 7 2/4/2024 Olympic F H Adult Arrest 8 2/4/2024 Van Nuys X X Adult Arrest 9 2/5/2024 Northeast F B Juv Arrest 10 2/6/2024 Harbor F H Adult Arrest 11 2/6/2024 N Hollywood F W Adult Arrest 12 2/6/2024 Wilshire M W Adult Arrest 13 2/7/2024 Newton M A Adult Arrest 14 2/7/2024 Southwest X X Adult Arrest 15 2/7/2024 West La F O Adult Arrest With the data pre-standardized, I can use either PROC SQL or PROC FedSQL to do my deduplicating, if desired. Both of these steps will produce results identical to the PROC SORT output: proc sql; /* Create the de-duplicated table */ create table nodups_sql as select distinct * from crime_std ; quit; proc FedSQL; /* First, make sure the table does not exist. FedSQL does not overwrite existing tables */ drop table nodups_FedSQL force; /* Create the de-duplicated table */ create table nodups_FedSQL as select distinct * from crime_std ; quit; So today we covered some of the challenges and pitfalls when deduplicating data and discovered how special variable lists _NUMERIC_, _ALL_, and prefix: can make short work of long lists of variables when writing SAS programs. Do you have any favorite tips for de-duplicating data? What other special variable lists have you found useful in your SAS code? Until next time, may the SAS be with you! Mark Grab the ZIP file containing the code and data for this blog series from my GitHub at https://github.com/SASJedi/blogPackages/raw/main/data_manipulation_in_base_sas.zip Links to prior posts in this series: Part 1 – Append Part 2 – Sort

Online Status	Offline
Date Last Visited	yesterday