- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Hi SAS Experts,
I am needing to read in a CSV file that has Polish characters. Below is an example of one file and I'm attaching a few as examples. The polish characters aren't retained properly when I use 'proc import' to read in the data. Because I will have multiple CSV files, I am trying to do is the following:
(1) Read in each csv file, which is data for one day
(2) Keep only data for the first two variables: "wojewodztwo", "liczba_przypadkow" and drop the rest of the variables
(3) Delete data from the first row
(4) Create a new variable for date in mmddyyyy10. format: 12/10/2020 as an example
(5) Rename "wojewodztwo", "liczba_przypadkow" into new variable names
(6) Repeat the above steps with multiple csv files with identical or very similar variable names, each csv file would contain data from different dates
(7) Merge all files
(8) Export a comma delimited csv file for use by a different programmer in R
Thanks so much in advance for your help!
wojewodztwo;liczba_przypadkow;liczba_na_10_tys_mieszkancow;zgony;zgony_w_wyniku_covid_bez_chorob_wspolistniejacych;zgony_w_wyniku_covid_i_chorob_wspolistniejacych;liczba_zlecen_poz;liczba_ozdrowiencow;liczba_osob_objetych_kwarantanna;liczba_wykonanych_testow;liczba_testow_z_wynikiem_pozytywnym;liczba_testow_z_wynikiem_negatywnym;liczba_pozostalych_testow;teryt
Ca≥y kraj;7152;1.86;419;81;338;10947;6423;190953;47141;7918;38419;804;t00
dolnoúlπskie;567;1.96;11;10;1;672;321;13251;3701;631;2998;72;t02
kujawsko-pomorskie;729;3.52;44;4;40;922;586;13724;3110;790;2277;43;t04
lubelskie;323;1.54;24;4;20;545;392;8274;2125;365;1727;33;t06
lubuskie;217;2.15;23;8;15;355;169;5384;1198;234;948;16;t08
≥Ûdzkie;410;1.67;44;4;40;669;452;12036;3502;474;2950;78;t10
ma≥opolskie;315;0.92;13;8;5;530;258;9588;2696;343;2311;42;t12
mazowieckie;874;1.61;63;7;56;1760;851;36493;6362;986;5231;145;t14
opolskie;143;1.46;7;0;7;181;145;4463;1219;174;1022;23;t16
podkarpackie;216;1.02;15;0;15;281;136;5725;1815;238;1569;8;t18
podlaskie;226;1.92;3;0;3;346;155;5427;981;242;729;10;t20
pomorskie;740;3.15;31;5;26;934;626;16756;3346;785;2493;68;t22
úlπskie;489;1.08;27;2;25;973;508;14718;4528;577;3899;52;t24
úwiÍtokrzyskie;147;1.20;26;3;23;175;121;2869;1245;174;1037;34;t26
warmiÒsko-mazurskie;415;2.92;26;8;18;690;568;10471;1905;443;1443;19;t28
wielkopolskie;740;2.11;36;17;19;1126;604;19211;3858;790;3014;54;t30
zachodniopomorskie;469;2.77;26;1;25;780;464;12538;2546;522;1940;84;t32
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Did you try reading the file using WLATIN2 encoding?
%let path=C:\Downloads;
%let fname=20210120054532_rap_rcb_woj_eksport.csv;
data test1;
infile "&path\&fname" encoding='wlatin2' dsd firstobs=2 dlm=';';
input var1 :$30.;
run;
proc freq; tables var1; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
1) Try to change ENCODING system option to UTF-8
2) Copy the PROC IMPORT generated code from the log and adapt it to your needs
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. How do I change the ENCODING system option to UTF-8? I mainly use SAS 9.4 (English), should I switch to SAS 9.4 (Unicode Support)?
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@RLC07 wrote:
Thanks. How do I change the ENCODING system option to UTF-8? I mainly use SAS 9.4 (English), should I switch to SAS 9.4 (Unicode Support)?
Yes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
ENCODING as a system option can be defined in the configuration file or override its definition on invoking sas.
In the .cfg file search for "encoding=". Tha current value is probably WLATIN. Update it to UTF-8.
In case it does not help look foe the 'exec sas <options>' and update it.
Few days ego somebody changed EN to U-8 in the exec command and that helped him.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
- Use a data-step
- Use the keep-statement
- Unsure: if you want to ignore the first row, use firstobs=2 in the infile-statement
- And the new variable should be created from what? A date is the number of days passed since 1Jan1960, a format takes care that it is human-readable.
- Rename-statment can do this: but: why create the variables with "wrong" names in the first place? Does not make sense.
- If the files are identical, use a wildcard in the infile-statement, if not you will have to write a data-step for each type.
- Another data-step using merge with by, all datasets have to be sorted according to the variables used in the by statement.
- Proc Export can do this, note that you have to use dbms=dlm and specify the delimiter. With dbms=csv the comma is used.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks. The files aren't mine and hence they were created with variable names that I do not want, along with the first row of data that are also of no use for us. I used the 'sort by', 'proc import' and 'proc export' to read in the CSV files and also to export the merged, cleaned file as a CSV file, but the Polish characters were messed up and couldn't be easily merged with the larger file for the R programmer unless the Polish names were manually corrected by him before hand. I can easily manipulate the files if the names are in English, I want to find a way to be able to read in all these CSV files with Polish names, clean and manipulate files, merge, and then export back out a CSV with the Polish names retained and the encoding intact.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I wouldn't use PROC IMPORT if you can avoid it. It cannot know how to define the variables since it has no information about how to define the variables (other than using the column headers to help if generate names for the variables). It just has to guess based on what it sees in the file (which might be only a subset of the universe of possible values for those variables).
You could use the original column headers as labels and then when you generate the new CSV file write the labels as the column headers instead of the variable names.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thank you! I used proc import initially but have switched over to using 'infile'. Unfortunately, I am unable to read in the file.
libname Poland 'C:\SAS\Poland';
filename extfile 'C:\Poland_raw\casefile.csv' encoding="utf-8";
data poland.casedata;
infile extfile;
input wojewodztwo $ liczba_przypadkow;
run;
I used the above codes that I found from the SAS user guide and adapted it and received this error:
ERROR: Invalid string.
FATAL: Unrecoverable I/O error detected in the execution of the DATA step program.
Aborted during the EXECUTION phase.
NOTE: 3 records were read from the infile EXTFILE.
I am now using SAS 9.4 (unicode support) and not SAS 9.4 (English) and I would love to be able to just successfully read in all of the observations the first two variables of the raw dataset without messing up the Polish characters, before I manipulate the data in SAS and exporting the file back out to a CSV with comma as the delimiter.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Use this to confirm your current session encoding:
proc options
option = encoding;
run;
Are you running SAS locally on your PC or on a remote SAS server? If you are using a remote SAS server then only a SAS administrator will be able to change session encoding for you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
I am running SAS locally and using the codes you shared, I was able to confirm the encoding is UTF-8. Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@RLC07 - To correctly read your CSV I think you will need to modify your FILENAME:
filename extfile 'C:\Poland_raw\casefile.csv' encoding="utf-8" dlm=',' dsd;
You may need to modify your INPUT statement too.
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Thanks everyone for your suggestions!
Using SAS 9.4 (unicode support), I was able to read in one of the first csv raw file in Polish. However, the Polish characters still are not showing up correctly. Any ideas?
Below are my codes:
libname Poland 'C:\SAS\Poland';
filename extfile 'C:\SAS\Poland_raw\casefile.csv' encoding="utf-8";
data a;
length loc_name $19.;
infile 'C:\SAS\Poland_raw\casefile.csv' ' firstobs=2 DLM=';' dsd;
input loc_name $ infected;
run;
This is what I see in my dataset:
Should I use proc cimport and I read in one of the many SAS user guides about switching locale, i.e. sas9 -locale pl_PL
is using this appropriate? And if so, where would I put this syntax within my data steps? Thanks all!
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
Did you try reading the file using WLATIN2 encoding?
%let path=C:\Downloads;
%let fname=20210120054532_rap_rcb_woj_eksport.csv;
data test1;
infile "&path\&fname" encoding='wlatin2' dsd firstobs=2 dlm=';';
input var1 :$30.;
run;
proc freq; tables var1; run;
- Mark as New
- Bookmark
- Subscribe
- Mute
- RSS Feed
- Permalink
- Report Inappropriate Content
@Tom Switching from UTF-8 encoding to WLATIN2 encoding worked - thank you so much for your insights and help! The Polish names from my export file matched with another file that was read in and cleaned by a colleague who used R. I will know next week whether he can successfully take the larger file that I will create and read and merge with his file of historic case data. When I encounter another raw file with non-English variables in the future, would you recommend I first use UTF-8 encoding and before switching to a different encoding if it doesn't work or simply read in the data file using the encoding that is best to support that particular foreign language? Thank you again!