Hi Experts,
I have a 2-part question and could use your help in reading in reading in another csv file. This time my trouble is reading in the date variable in with the proper formatting while using 'infile', instead of 'proc import', because of the Polish characters.
The csv file that I need to read in has 3 variables. The dates are shown as mm/dd/yyyy (eg. 01/28/2021) but as you can see below, all the dates are missing.
%let path=C:\Users\SAS\Poland;
%let fname=Poland_cases.csv;
options DATESTYLE=mdy;
data master2;
length loc_name $20.;
format date mmddyy10.;
infile "&path\&fname" encoding='utf-8' dsd firstobs=2 dlm=',';
input loc_name $ infected
@10 date anydtdte12.;
run;
For this file, I have to use encoding ='utf-8' or instead of encoding ='wlatin2' or else the Polish characters wouldn't read in properly. For reading in the date, how should I change my code? I've tried a few different things already.
And the second part of my question is, will it be problematic for me to merge this file (encoding = 'utf-8') with another weekly file that was created using the daily raw data files that were read in using encoding='wlatin2'? Thanks very much!
Please post a few lines of the file by copy/pasting into a window opened with the </> button. Open the file with a text editor (not with Excel!) to do this.
loc_name,infected,date Dolnośląskie,1035,11/24/2020 Dolnośląskie,825,11/25/2020 Dolnośląskie,1112,11/26/2020 Dolnośląskie,1318,11/27/2020
You do not want to use pointer movement controls ( the @ in your code) or fixed format mode with a delimited file. Just let the INPUT statement parse the line based on the delimiter. Use the : (colon) modifier in front of any in-line informats included in your INPUT statement so that values are read in list mode.
Also I wouldn't use the ANYDTDTE informat, it will create different dates for the some input string based on the LOCALE setting. Better to use the informat that actually matches the order that month , day and year fields are expected to appear in the data. Plus if you use year month day order when displaying the dates then you won't confuse half your audience when the DAY number is 12 or less.
data want ;
infile cards dsd truncover ;
input loc_name :$20. infected date :mmddyy.;
format date yymmdd10.;
cards;
Dolnośląskie,1035,11/24/2020
Dolnośląskie,825,11/25/2020
Dolnośląskie,1112,11/26/2020
Dolnośląskie,1318,11/27/2020
;
proc print;
run;
Results:
Thanks for the detailed explanation. I included '@' after finding an example code that also suggested I used the ANYDTDTE informat in order to read in dates from a csv in any format. Obviously, it didn't work. Thanks for providing me with the codes - it's been many years since I have used the card statement, but this read in my file accurately. Thanks again!
For the read part, this should do it:
%let path=C:\Users\SAS\Poland;
%let fname=Poland_cases.csv;
data master2;
length loc_name $20.;
format date mmddyy10.;
infile "&path\&fname" encoding='utf-8' dsd firstobs=2 dlm=',';
input
loc_name
infected
date :mmddyy10.
;
run;
For the merge, get to know your data better. For instance, display the loc_name with a $hex format (length double that of the variable), so you can see the real content, and if you need to do a translation.
Also do a search for "sas convert character from latin to utf", you will find lots of helpful information.
Thanks for the code; it read in the file properly. Thank you and thanks also for search term for finding info to the second part of my question.
If you are going to use mixed encodings then you will want to run your session using UTF-8 encoding. Otherwise it will be impossible to map all of the different characters represented in a different single byte encoding to the single byte encoding that your session is using.
But as long as the source datasets are properly created with the right metadata about what encoding the characters are using you should be able to combine them in your UTF-8 session. One thing to watch out for is whether the strings are defined long enough to store the UTF-8 codes. In a single byte encoding (like WLATIN1 or WLATIN2) each character takes one byte of storage. But when translating to UTF-8 codes some characters take up to 4 bytes.
@Tom Thanks for your advice! I ran the session using UTF-8 encoding and merging the two files worked like a charm. What happens if I were to run my session without using SAS 9.4 Unicode Support (e.g. SAS 9.4 English) but specify the encoding (UTF-8 or WLATIN2) when each file is read in and try to merge them? Assuming SAS 9.4 English with encoding specified can support reading in the characters correctly, will I have to convert the encoding for one of the files and make sure they are the same before the merge?
It depends on the actual characters in the file. If the file is using WLATIN2 and all of the characters that actually appear in the file can be mapped into your current SAS sessions encoding (let's assume that it is using WLATIN1) then you will have no problems. But as soon as one of the input text files has something that is not one of the 256 possible characters supported by WLATIN1 then you will get a null value. Now it the other file you are matching has the same character (prehaps represented in a different encoding) then it will also be mapped to the same strange thing. So the strings will match, but you will get some false positive matches where two string are different in just one character but neither character from the two sources can map into your current encoding.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.