If you use PROC IMPORT to read a text file (CSV files are comma delimited text files) it has to GUESS how to define the variables. Plus to save time it will by default only look at the beginning of the file to get information to make its guesses. By saving the file as an XLSX file you ended up using a different method of conversion and show it did not guess the wrong length for that variable. But by opening a CSV file with EXCEL you have now risked having excel convert the values of the data. Excel will be default convert strings that it thinks look like number or dates. So strings like ZIPCODEs that might start with a leading zero will lose the leading zero. And stings that consist of two sets of digits separated by a hyphen might get converted in dates.
Write your own data step to read the CSV file and you will have full control over how the variables are defined and read.
@ldierker1 wrote:
My code:data new; length country $50 ; format country $50.; informat country $50.; set work.lifeexpectancy;
keep country _2009 _2010 _2011 _2012 _2013 _2014 _2015 _2016 _2017 _2018;proc sort; by country;proc contents; run;proc freq; tables country; run;My output:The contents shows that country is a character variable with $50 len, $50 format and $50 informat;In the table, several strings are getting cut off well before 50 length.e.g. "Sao Tome and Princi" or "St. Vincent and the"This is causing troubles later when I use the string to make secondary variables.
So when you run Proc Contents on work.lifeexpectancy what do you see for the variables?
If the variable from the Lifeexpentancy , or any other data set on a SET statement has length of 25 that is the longest value you are going to get even if you redefine the property with a length statement as you have done here.
Consider:
data example; somevar = 'Not very long text'; run; proc contents data=example; run; data modified; length somevar $ 50; set example; run; proc contents data=modified; run;
The first data set only has 18 characters in the value. So even though I change the property of the variable, and will show in proc contents as longer, there is no information added just because I made the variable longer.
Thank you. I did not realize that use of a length statement allow you to redefine length in the case where the longest value is shorter. Sounds like I was trying to put a square in round hole. I think that I have resolved this, though I did so in a way that leaves me with questions. The data set that I was using was from gapminder.org. When I downloaded the xlsx file and ran the program, I received an "exception error". I have seem this before and knew it was an unknown formatting bug that would not let me be successful with that version. Next, I downloaded the csv file. This was the file I was having difficulty with when I posted my question. Once you shared the detail about the limitations of the length statement, I again wondered if it was a formatting issue with the original csv file. I opened it and saved the csv as xlsx (rather than downloading the xlsx from the website). This time my length statement before SET achieved what I wanted it to and there was no more truncation. Experience like this have been accumulating for me. That is, I start with a standard file format, but find it does not behave in a standard way. Thanks again for the length tip! Very helpful.
If you use PROC IMPORT to read a text file (CSV files are comma delimited text files) it has to GUESS how to define the variables. Plus to save time it will by default only look at the beginning of the file to get information to make its guesses. By saving the file as an XLSX file you ended up using a different method of conversion and show it did not guess the wrong length for that variable. But by opening a CSV file with EXCEL you have now risked having excel convert the values of the data. Excel will be default convert strings that it thinks look like number or dates. So strings like ZIPCODEs that might start with a leading zero will lose the leading zero. And stings that consist of two sets of digits separated by a hyphen might get converted in dates.
Write your own data step to read the CSV file and you will have full control over how the variables are defined and read.
You've investigated the length of the variable storage. But perhaps the variable value was not read in correctly. So make sure there is a $50 character value in COUNTRY. Do a proc print of the country variable, or open the dataset interactively to examine visually.
Learn how use the CAT functions in SAS to join values from multiple variables into a single value.
Find more tutorials on the SAS Users YouTube channel.
Ready to level-up your skills? Choose your own adventure.