BookmarkSubscribeRSS Feed
Elliott
Obsidian | Level 7

Hello,

 

I have been tasked with taking large sas datasets of data and converting to .csv files to be used to load to hadoop.

 

The data is daily and monthly and very large.  I have tried the proc export but that did not produce a file with all the records.

I have tried a data _null_ and that created the file but the field names were missing and there was data that did not have the correct formatting.

 

I have to convert 13 months of data this way, how can this be done so that the .csv contains all the records and has correct formatting?

 

Thanks,

Elliott

14 REPLIES 14
Kurt_Bremser
Super User

If your csv did not contain all the observations you expected, the log should have something to say about it.

Unless you had global obs= option set that limited your processing.

r_behata
Barite | Level 11

Could you please show your code.  particularly the filename statement.

 

Does it look like this :

 

filename out hadoop '/user/testing/' user='xxxx'  pass='xxxx' recfm=v  lrecl=32167 dir ;

Tom
Super User Tom
Super User

CSV files are easy to make.  Especially if you don't need to include a header row.

data _null_;
  set have ;
  file 'myfile.csv' dsd ;
  put (_all_) (+0);
run;

If you want to add a header row you could just hard code it.  So for example you could explicitly list the variables to write and add an extra line at the top with the variable names.

data _null_;
  set have ;
  file 'myfile.csv' dsd ;
  if _n_=1 then put 'var1,var2,var3';
  put var1 var2 var3 ;
run;

Or you could write the file in two steps and generate the header from the metadata.

proc transpose data=have(obs=0) out=names ; run;
data _null_;
  file 'myfile.csv' dsd ;
  set names end=eof;
  put _name_ @;
  if eof then put;
run;
data _null_;
  set have ;
  file 'myfile.csv' dsd mod ;
  put (_all_) (+0);
run;

 If you are having trouble with the "format" then you need to explain what trouble that is. Using the DSD option SAS will create valid CSV files.  That is that each observation is one line of the file. Fields are separated by commas.  Any value that contains a comma or a quote character is enclosed in quotes. Quotes in the data are represented by two quotes. 

Example:

id,Description
1,First
2,"two, three and four"
3,"Has ""quotes"" embedded"
"1,123",Number has comma in it
Elliott
Obsidian | Level 7
This is what I was looking for and have tested... but
What about the formatting of the data, there are variables that have long strings of numbers that run into scientific notation and I need all numbers to appear. Can they be converted to a text format in the Data _null_ statment?


Thanks,
Tom
Super User Tom
Super User

@Elliott wrote:
This is what I was looking for and have tested... but
What about the formatting of the data, there are variables that have long strings of numbers that run into scientific notation and I need all numbers to appear. Can they be converted to a text format in the Data _null_ statment?


Thanks,

The PUT statement will honor the FORMAT that you have attached to the variable. You could try BEST32. format for numbers.  But if they are really "long strings of numbers" then they probably should be character variables in the original SAS dataset.  SAS can only store about 15 decimal places accurately.  Strings of digits with more precision than that will be rounded.

Elliott
Obsidian | Level 7

Hi Tom,

 

The data _null_ did not work when I tried to pull in the variable names, it did not pick up all of them.

 

I went back the the proc export and when I read the log it does pull all the records, I had tried to download and that is where I thought it did not pull all the records.  

 

The proc export log does show the correct formats but when I look at the excel with the data it is not displaying correctly.  Not sure what to do with that.

 

Can a put stmt be used in proc export?

 

Thanks,

Tom
Super User Tom
Super User

Do not let Excel automatically open a CSV file.  It will autoconvert values for you. So it will do stupid things like convert a value like '10-12' into a DATE.

 

If you must open the file in Excel then use the manual data import tools.  That will give you the opportunity to tell Excel how to treat each column.  Which ones are numbers and which are strings. Which are Date or Time values, etc.

 

Or just use PROC EXPORT to write an XLSX file to begin with.

Elliott
Obsidian | Level 7
the requested specifically wants a .csv file, they will be reading it into hadoop
Tom
Super User Tom
Super User

@Elliott wrote:
the requested specifically wants a .csv file, they will be reading it into hadoop

Ok. So ignore Excel.  What does the actual CSV file look like?  It is a text file so you can look at it with any text editor.

 

If there are specific values that are causing trouble for Hadoop then please create a small sample dataset that reproduces the issue and post that here (in the form of a data step that others can run) and example of the file that PROC EXPORT produces for that sample data and a description of what about it does not work for Hadoop.  Make sure to use the Insert Code or Insert SAS code buttons in the forum editor so your sample program and CSV file can be posted without being interpreted by the forum and paragraphs of text to be reformatted.

ballardw
Super User

@Elliott wrote:

Hello,

 

I have been tasked with taking large sas datasets of data and converting to .csv files to be used to load to hadoop.

 

The data is daily and monthly and very large.  I have tried the proc export but that did not produce a file with all the records.

I have tried a data _null_ and that created the file but the field names were missing and there was data that did not have the correct formatting.

 

I have to convert 13 months of data this way, how can this be done so that the .csv contains all the records and has correct formatting?

 

Thanks,

Elliott


This may sound like a silly question but exactly how did you determine the number of exported records?

If the answer involves a spreadsheet an your data is as large as you imply the file may have exceeded the number of rows a spreadsheet will work with and so would not show the "correct number of observations".

 

I used to get to explain this to graduate students about every two weeks when Excel had a 65K line limit and the data sets our shop used frequently exceeded that limit.

Elliott
Obsidian | Level 7

Sorry just getting back after the holidays... the users have come back after testing with the data I provided and now they want double quotes around each field so if it is a lastname, firstname issue then it can be accounted for.  does that make sense?

 

my current code is:

proc export data=myfile

                   outfile="mypath/myfile.csv"

                   dbms=csv

                   replace;

                   putnames=yes;

run;

 

Thanks,

ballardw
Super User

@Elliott wrote:

Sorry just getting back after the holidays... the users have come back after testing with the data I provided and now they want double quotes around each field so if it is a lastname, firstname issue then it can be accounted for.  does that make sense?

 

my current code is:

proc export data=myfile

                   outfile="mypath/myfile.csv"

                   dbms=csv

                   replace;

                   putnames=yes;

run;

 

Thanks,


Is the requirement to place double quotes around all character variables or only specific ones?

Proc export, as you have likely noticed, will place quotes around fields that may have an issue with comma locations only such as an imbedded comma in a value. With a more specific requirement, such as double quotes even when a comma is not in the value then you will have to go to data step coding. Which will require you to provide 1) the  header row for the first column and 2) to create the output line. Since you say you previously had issues with these I provide a brief example

data _null_;
   /* use the OBS= option to only output 5 rows of data for 
      quick review of results*/
   set sashelp.class (obs=5);
   /* send exampe to the results window for easy viewing*/
   file print;
   /* header only output once*/
   if _n_=1 then put '"Name","Sex","Age","Height","Weight"';
   /* long enough to hold longest expected record*/
   length str $100.; 
   str = catx(',',quote(strip(name)),quote(strip(sex)),age,height,weight);
   /* or to apply specific formats*/
   str = catx(',',quote(strip(name)),quote(strip(sex)),age,put(height,4.1),put(weight,6.2));

   put str;
run;

If you do not have a explicit format assigned the proc export will use a "best" format to convert numeric to text for the export. You did not explicitly state what the format issues were but I suspect one of the issues may have been inconsistent decimal appearance such as the BEST format creates. Also if your values have enough digits then the default, I believe BEST12. may truncate some values.

 

The strip is used in the above code to prevent values appearing as " John  " or similar in the output.  Likely the quote is needed for the SEX field but the quote function used to place the double quotes will by default pad the length of the text to the defined length yielding potentially many blanks in the data.

 

I used the 6.2 to demonstrate that you can force appearance of 0 values past the actual values of the data if desired.

I do realize this potentially is a fair amount of code depending on your number of variables but that is result when custom output is needed. If only a few variables and not all character variables need the quotes then the code is a bit simpler.

You may need to ask your crew whether the column headers need the quotes or not. If you want a comma as part of the header (not recommended) you will need the quotes for any columns that do so.

SASKiwi
PROC Star

@Elliott - I'm curious to know why you are using CSVs to load Hadoop when SAS/ACCESS to Hadoop provides the ability to load SAS data directly and without any of the translation issues you are striking? Of course if you don't have SAS/ACCESS to Hadoop that would explain it.

Tom
Super User Tom
Super User

PROC EXPORT using DBMS=CSV will automatically add quotes around fields where they are required. The same is true for the data step solutions using the DSD option on the FILE statement.

 

Do you want to force quotes around character variables even for values that do not require quotes?

sas-innovate-2024.png

Available on demand!

Missed SAS Innovate Las Vegas? Watch all the action for free! View the keynotes, general sessions and 22 breakouts on demand.

 

Register now!

How to Concatenate Values

Learn how use the CAT functions in SAS to join values from multiple variables into a single value.

Find more tutorials on the SAS Users YouTube channel.

Click image to register for webinarClick image to register for webinar

Classroom Training Available!

Select SAS Training centers are offering in-person courses. View upcoming courses for:

View all other training opportunities.

Discussion stats
  • 14 replies
  • 5622 views
  • 6 likes
  • 6 in conversation