SAS Programming

RLC07 · Posted 01-22-2021 12:42 AM

Hi SAS Experts,
I am needing to read in a CSV file that has Polish characters. Below is an example of one file and I'm attaching a few as examples. The polish characters aren't retained properly when I use 'proc import' to read in the data. Because I will have multiple CSV files, I am trying to do is the following:

(1) Read in each csv file, which is data for one day
(2) Keep only data for the first two variables: "wojewodztwo", "liczba_przypadkow" and drop the rest of the variables

(3) Delete data from the first row
(4) Create a new variable for date in mmddyyyy10. format: 12/10/2020 as an example
(5) Rename "wojewodztwo", "liczba_przypadkow" into new variable names
(6) Repeat the above steps with multiple csv files with identical or very similar variable names, each csv file would contain data from different dates
(7) Merge all files
(8) Export a comma delimited csv file for use by a different programmer in R

Thanks so much in advance for your help!

wojewodztwo;liczba_przypadkow;liczba_na_10_tys_mieszkancow;zgony;zgony_w_wyniku_covid_bez_chorob_wspolistniejacych;zgony_w_wyniku_covid_i_chorob_wspolistniejacych;liczba_zlecen_poz;liczba_ozdrowiencow;liczba_osob_objetych_kwarantanna;liczba_wykonanych_testow;liczba_testow_z_wynikiem_pozytywnym;liczba_testow_z_wynikiem_negatywnym;liczba_pozostalych_testow;teryt
Ca≥y kraj;7152;1.86;419;81;338;10947;6423;190953;47141;7918;38419;804;t00
dolnoúlπskie;567;1.96;11;10;1;672;321;13251;3701;631;2998;72;t02
kujawsko-pomorskie;729;3.52;44;4;40;922;586;13724;3110;790;2277;43;t04
lubelskie;323;1.54;24;4;20;545;392;8274;2125;365;1727;33;t06
lubuskie;217;2.15;23;8;15;355;169;5384;1198;234;948;16;t08
≥Ûdzkie;410;1.67;44;4;40;669;452;12036;3502;474;2950;78;t10
ma≥opolskie;315;0.92;13;8;5;530;258;9588;2696;343;2311;42;t12
mazowieckie;874;1.61;63;7;56;1760;851;36493;6362;986;5231;145;t14
opolskie;143;1.46;7;0;7;181;145;4463;1219;174;1022;23;t16
podkarpackie;216;1.02;15;0;15;281;136;5725;1815;238;1569;8;t18
podlaskie;226;1.92;3;0;3;346;155;5427;981;242;729;10;t20
pomorskie;740;3.15;31;5;26;934;626;16756;3346;785;2493;68;t22
úlπskie;489;1.08;27;2;25;973;508;14718;4528;577;3899;52;t24
úwiÍtokrzyskie;147;1.20;26;3;23;175;121;2869;1245;174;1037;34;t26
warmiÒsko-mazurskie;415;2.92;26;8;18;690;568;10471;1905;443;1443;19;t28
wielkopolskie;740;2.11;36;17;19;1126;604;19211;3858;790;3014;54;t30
zachodniopomorskie;469;2.77;26;1;25;780;464;12538;2546;522;1940;84;t32

Tom · Posted 01-22-2021 11:44 PM

Did you try reading the file using WLATIN2 encoding?

%let path=C:\Downloads;
%let fname=20210120054532_rap_rcb_woj_eksport.csv;

data test1;
  infile "&path\&fname" encoding='wlatin2' dsd firstobs=2 dlm=';';
  input var1 :$30.;
run;

proc freq; tables var1; run;

View solution in original post

Shmuel · Posted 01-22-2021 01:43 AM

1) Try to change ENCODING system option to UTF-8

2) Copy the PROC IMPORT generated code from the log and adapt it to your needs

RLC07 · Posted 01-22-2021 10:06 AM

Thanks. How do I change the ENCODING system option to UTF-8? I mainly use SAS 9.4 (English), should I switch to SAS 9.4 (Unicode Support)?

Tom · Posted 01-22-2021 11:04 AM

@RLC07 wrote:

Thanks. How do I change the ENCODING system option to UTF-8? I mainly use SAS 9.4 (English), should I switch to SAS 9.4 (Unicode Support)?

Yes.

Shmuel · Posted 01-22-2021 01:08 PM

ENCODING as a system option can be defined in the configuration file or override its definition on invoking sas.

In the .cfg file search for "encoding=". Tha current value is probably WLATIN. Update it to UTF-8.

In case it does not help look foe the 'exec sas <options>' and update it.

Few days ego somebody changed EN to U-8 in the exec command and that helped him.

andreas_lds · Posted 01-22-2021 01:47 AM

Use a data-step
Use the keep-statement
Unsure: if you want to ignore the first row, use firstobs=2 in the infile-statement
And the new variable should be created from what? A date is the number of days passed since 1Jan1960, a format takes care that it is human-readable.
Rename-statment can do this: but: why create the variables with "wrong" names in the first place? Does not make sense.
If the files are identical, use a wildcard in the infile-statement, if not you will have to write a data-step for each type.
Another data-step using merge with by, all datasets have to be sorted according to the variables used in the by statement.
Proc Export can do this, note that you have to use dbms=dlm and specify the delimiter. With dbms=csv the comma is used.

RLC07 · Posted 01-22-2021 10:13 AM

Thanks. The files aren't mine and hence they were created with variable names that I do not want, along with the first row of data that are also of no use for us. I used the 'sort by', 'proc import' and 'proc export' to read in the CSV files and also to export the merged, cleaned file as a CSV file, but the Polish characters were messed up and couldn't be easily merged with the larger file for the R programmer unless the Polish names were manually corrected by him before hand. I can easily manipulate the files if the names are in English, I want to find a way to be able to read in all these CSV files with Polish names, clean and manipulate files, merge, and then export back out a CSV with the Polish names retained and the encoding intact.

Tom · Posted 01-22-2021 11:03 AM

I wouldn't use PROC IMPORT if you can avoid it. It cannot know how to define the variables since it has no information about how to define the variables (other than using the column headers to help if generate names for the variables). It just has to guess based on what it sees in the file (which might be only a subset of the universe of possible values for those variables).

You could use the original column headers as labels and then when you generate the new CSV file write the labels as the column headers instead of the variable names.

RLC07 · Posted 01-22-2021 06:12 PM

Thank you! I used proc import initially but have switched over to using 'infile'. Unfortunately, I am unable to read in the file.

libname Poland 'C:\SAS\Poland';

filename extfile 'C:\Poland_raw\casefile.csv' encoding="utf-8";

data poland.casedata;
infile extfile;
input wojewodztwo $ liczba_przypadkow;
run;

I used the above codes that I found from the SAS user guide and adapted it and received this error:

ERROR: Invalid string.
FATAL: Unrecoverable I/O error detected in the execution of the DATA step program.
Aborted during the EXECUTION phase.
NOTE: 3 records were read from the infile EXTFILE.

I am now using SAS 9.4 (unicode support) and not SAS 9.4 (English) and I would love to be able to just successfully read in all of the observations the first two variables of the raw dataset without messing up the Polish characters, before I manipulate the data in SAS and exporting the file back out to a CSV with comma as the delimiter.

SASKiwi · Posted 01-22-2021 08:28 PM

Use this to confirm your current session encoding:

proc options
option = encoding;
run;

Are you running SAS locally on your PC or on a remote SAS server? If you are using a remote SAS server then only a SAS administrator will be able to change session encoding for you.

RLC07 · Posted 01-22-2021 08:51 PM

I am running SAS locally and using the codes you shared, I was able to confirm the encoding is UTF-8. Thanks!

SASKiwi · Posted 01-22-2021 09:06 PM

@RLC07 - To correctly read your CSV I think you will need to modify your FILENAME:

filename extfile 'C:\Poland_raw\casefile.csv' encoding="utf-8" dlm=',' dsd;

You may need to modify your INPUT statement too.

RLC07 · Posted 01-22-2021 10:47 PM

Thanks everyone for your suggestions!

Using SAS 9.4 (unicode support), I was able to read in one of the first csv raw file in Polish. However, the Polish characters still are not showing up correctly. Any ideas?

Below are my codes:

libname Poland 'C:\SAS\Poland';

filename extfile 'C:\SAS\Poland_raw\casefile.csv' encoding="utf-8";
data a;
length loc_name $19.;
infile 'C:\SAS\Poland_raw\casefile.csv' ' firstobs=2 DLM=';' dsd;
input loc_name $ infected;
run;

This is what I see in my dataset:

Should I use proc cimport and I read in one of the many SAS user guides about switching locale, i.e. sas9 -locale pl_PL

is using this appropriate? And if so, where would I put this syntax within my data steps? Thanks all!

Tom · Posted 01-22-2021 11:44 PM

Did you try reading the file using WLATIN2 encoding?

%let path=C:\Downloads;
%let fname=20210120054532_rap_rcb_woj_eksport.csv;

data test1;
  infile "&path\&fname" encoding='wlatin2' dsd firstobs=2 dlm=';';
  input var1 :$30.;
run;

proc freq; tables var1; run;

RLC07 · Posted 01-23-2021 03:15 PM

@Tom Switching from UTF-8 encoding to WLATIN2 encoding worked - thank you so much for your insights and help! The Polish names from my export file matched with another file that was read in and cleaned by a colleague who used R. I will know next week whether he can successfully take the larger file that I will create and read and merge with his file of historic case data. When I encounter another raw file with non-English variables in the future, would you recommend I first use UTF-8 encoding and before switching to a different encoding if it doesn't work or simply read in the data file using the encoding that is best to support that particular foreign language? Thank you again!

SAS Programming

How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Re: How to read in CSV file in Polish and?

Follow Us

What is...

SAS Programming

Special offer for SAS Communities members

SAS Training: Just a Click Away

Follow Us

What is...