BookmarkSubscribeRSS Feed
🔒 This topic is solved and locked. Need further help from the community? Please sign in and ask a new question.
ldierker1
Calcite | Level 5
My code:     
data new; length country $50 ; format country $50.; informat country $50.; set       work.lifeexpectancy;
keep country _2009 _2010 _2011 _2012 _2013 _2014 _2015 _2016 _2017 _2018;
proc sort; by country;
proc contents; run;
proc freq; tables country; run;
 
My output:
The contents shows that country is a character variable with $50 len, $50 format and $50 informat;
 
In the table, several strings are getting cut off well before 50 length. 
e.g. "Sao Tome and Princi" or "St. Vincent and the"
This is causing troubles later when I use the string to make secondary variables. 
 
Thank you for your help. 
 
 
 
 

 

1 ACCEPTED SOLUTION

Accepted Solutions
Tom
Super User Tom
Super User

If you use PROC IMPORT to read a text file (CSV files are comma delimited text files) it has to GUESS how to define the variables.   Plus to save time it will by default only look at the beginning of the file to get information to make its guesses.  By saving the file as an XLSX file you ended up using a different method of conversion and show it did not guess the wrong length for that variable.  But by opening a CSV file with EXCEL you have now risked having excel convert the values of the data.  Excel will be default convert strings that it thinks look like number or dates. So strings like ZIPCODEs that might start with a leading zero will lose the leading zero.  And stings that consist of two sets of digits separated by a hyphen might get converted in dates.

 

Write your own data step to read the CSV file and you will have full control over how the variables are defined and read.

View solution in original post

5 REPLIES 5
ballardw
Super User

@ldierker1 wrote:
My code:     
data new; length country $50 ; format country $50.; informat country $50.; set       work.lifeexpectancy;
keep country _2009 _2010 _2011 _2012 _2013 _2014 _2015 _2016 _2017 _2018;
proc sort; by country;
proc contents; run;
proc freq; tables country; run;
 
My output:
The contents shows that country is a character variable with $50 len, $50 format and $50 informat;
 
In the table, several strings are getting cut off well before 50 length. 
e.g. "Sao Tome and Princi" or "St. Vincent and the"
This is causing troubles later when I use the string to make secondary variables. 
 

So when you run Proc Contents on work.lifeexpectancy what do you see for the variables?

 

If the variable from the Lifeexpentancy , or any other data set on a SET statement has  length of 25 that is the longest value you are going to get even if you redefine the property with a length statement as you have done here.

Consider:

data example;
   somevar = 'Not very long text';
run;

proc contents data=example;
run;

data modified;
   length somevar $ 50;
   set example;
run;

proc contents data=modified;
run;

The first data set only has 18 characters in the value. So even though I change the property of the variable, and will show in proc contents as longer, there is no information added just because I made the variable longer.

ldierker1
Calcite | Level 5

Thank you. I did not realize that use of a length statement allow you to redefine length in the case where the longest value is shorter. Sounds like I was trying to put a square in round hole. I think that I have resolved this, though I did so in a way that leaves me with questions. The data set that I was using was from gapminder.org. When I downloaded the xlsx file and ran the program, I received an "exception error". I have seem this before and knew it was an unknown formatting bug that would not let me be successful with that version. Next, I downloaded the csv file. This was the file I was having difficulty with when I posted my question. Once you shared the detail about the limitations of the length statement, I again wondered if it was a formatting issue with the original csv file. I opened it and saved the csv as xlsx (rather than downloading the xlsx from the website). This time my length statement before SET achieved what I wanted it to and there was no more truncation. Experience like this have been accumulating for me. That is, I start with a standard file format, but find it does not behave in a standard way. Thanks again for the length tip! Very helpful.

Tom
Super User Tom
Super User

If you use PROC IMPORT to read a text file (CSV files are comma delimited text files) it has to GUESS how to define the variables.   Plus to save time it will by default only look at the beginning of the file to get information to make its guesses.  By saving the file as an XLSX file you ended up using a different method of conversion and show it did not guess the wrong length for that variable.  But by opening a CSV file with EXCEL you have now risked having excel convert the values of the data.  Excel will be default convert strings that it thinks look like number or dates. So strings like ZIPCODEs that might start with a leading zero will lose the leading zero.  And stings that consist of two sets of digits separated by a hyphen might get converted in dates.

 

Write your own data step to read the CSV file and you will have full control over how the variables are defined and read.

mkeintz
PROC Star

You've investigated the length of the variable storage.  But perhaps the variable value was not read in correctly.  So make sure there is a $50 character value in COUNTRY.  Do a proc print of the country variable, or open the dataset interactively to examine visually.

--------------------------
The hash OUTPUT method will overwrite a SAS data set, but not append. That can be costly. Consider voting for Add a HASH object method which would append a hash object to an existing SAS data set

Would enabling PROC SORT to simultaneously output multiple datasets be useful? Then vote for
Allow PROC SORT to output multiple datasets

--------------------------

hackathon24-white-horiz.png

The 2025 SAS Hackathon has begun!

It's finally time to hack! Remember to visit the SAS Hacker's Hub regularly for news and updates.

Latest Updates

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

SAS Training: Just a Click Away

 Ready to level-up your skills? Choose your own adventure.

Browse our catalog!

Discussion stats
  • 5 replies
  • 11247 views
  • 0 likes
  • 4 in conversation