DATA Step, Macro, Functions and more

How do I split my data set into multiple files? second

Reply
Contributor
Posts: 30

How do I split my data set into multiple files? second

Sorry about this but my replies were for some reason taken for review and needs to wait for approvals from the board for my previous question thread (https://communities.sas.com/t5/Base-SAS-Programming/How-do-I-split-my-data-set-into-multiple-files/m...)

The question is the same...  I need to split file (Text file) into smaller files in order to process them. Many thanks to RW9 and Tom. I am now able to split the files as Tom suggested. 

create table tmp as
 select * from connection to sqlserver
  (select
      substring(cast(bigfile_xml as nvarchar(max)),1,17000) as XMLCol1
    , substring(cast(bigfile_xml as nvarchar(max)),17001,17000) as XMLCol2
    , substring(cast(bigfile_xml as nvarchar(max)),34001,17000) as XMLCol3
...
   from
      bigfile
  where id = 1;
   )
;

 However, I was hoping for a way to also allow the code to derive an appropriate number of variables needed as the number of characters in each observation could be different. 

I tried writing a code into it but it didn't work. I'd appreciate any pointers on where I went wrong or what I should do.

Thank you

Code:

 

%let BigfileC= CountC(bigfile_xml);              *CountC can be used to count characters instead of words.;

%let chunks = round(BigfileC/32000);   *i'm trying to round it up so I know how many chunks I need for it ;

 

I = chunks;                                         * trying to set a value to I for the DO LOOP;

Array Text(I);                                       * didnt think I needed Array but errors suggest I do;

Do I = 0 to chunks;                              * I'm thinking a DO LOOP would be a possible method for this splitting?;

 

Then finally I can just apply these into the code Tom suggested

Substring(cast(bigfile_xml as nvarchar(max)),I*17000+1,(I+1)*17000) as XMLCol(I) ;   * This is how I thought would help create the files.;

 

This doesn't work. Please help. and sorry for the second thread but Services told me I might have to wait a long time before they approve my reply in my previous thread...

 

Contributor
Posts: 30

Re: How do I split my data set into multiple files? second

Posted in reply to AlexMoreton

I would also need to combine them together again after cleaning the data so if anyone have any idea on the best way to combine/merge them again (i've been looking into UNION and MERGE right now). please let me know.

As of now the string is not broken into different columns/ variables which UNION does not seem to be working...

Super User
Super User
Posts: 9,813

Re: How do I split my data set into multiple files? second

Posted in reply to AlexMoreton

Not sure I follow you.  First you are not able to combine the variables back into one in a SAS dataset, SAS Datasets have a limit of 2000 characters per variable.

Also, its rarely a good idea to put data items across the page (called transposed) as this just makes working with the data far more difficult.  Create an observation for each break so you have a list of data in one column, you will find it much easier to write code this way.

Why do you need to combine this data again?  Is it just to write to a file, if so have your observations and just put them out to a file.  If you want to do analysis, then consider breaking the data up into a usable format.

As we have yet to see anything in the way of inputs or what you want out, its impossible to tell you have to do the middle bit.

Contributor
Posts: 30

Re: How do I split my data set into multiple files? second

Hi RW9,

The file is a text file (as you suggested) and the string itself is roughly 50k - 100k characters long.

I am calling the data from a SQL server (as a result I was not sure what file type it was/is but i think its text file as it matched your description). For the purpose of my project I am to break the string (retrieved from sql server using select from) down into smaller chunks and run cleansing for all of them and then export them as a single XML file with all of them combined. The file that is retrieved is literally one single observations of a string of characters.

 

The idea of breaking the string via observation so I only end up a list of data with only one column was what I wanted. But as I was unable to create a code to do anything... Breaking it into variables became the next best thing... I do not know how to code it to break via observations based on character numbers (as I still want to be able to know how many characters each file has (different text files will have different lengths).

 Please tell me what more information you need and I will try and supply them to you.

Thank you

Super User
Super User
Posts: 9,813

Re: How do I split my data set into multiple files? second

Posted in reply to AlexMoreton

I provided code:

https://communities.sas.com/t5/Base-SAS-Programming/How-do-I-split-my-data-set-into-multiple-files/m...

 

However, as you seem to be using SQL and mentioning SQL server, it sounds like your taking data from a database.  This is not the same as for a text file.

 

We seem to be getting the wrong end of the stick here. There are three processes involved here as I see it:

1) Read the data from the database into a dataset - @Tom has provided code to do this, and you get a dataset with lots of variables yes?  I still don't see why you get one long text string, does it not have delimiters or breaks or something?

2) Process this data to "cleanse it" - no information here so can't help

3) Write this final dataset out to an XML file.  This should be quite simple:

data _null_;
  file "c:\want.xml";
  set have;
  put "<xml version=1.2>";
  put "<dataset>";
  put _all_;
  put "</dataset>"; 
run;
  

This is just an example of writing a text file with some tags to be an XML file.

Contributor
Posts: 30

Re: How do I split my data set into multiple files? second

Hi RW9,

I have tried your code but as it seems that I cannot save the file into a library first as the it does not contain all the characters. 

When I worked Tom's code, I was able to find the complete code separated into 5 columns (not rows...). 

So in this sense I was unable to follow your code due to the inability to use the FILE ' ' step. 

I tried the code with a dummy file that consisted 90K characters but an error comes out.

ERROR: Either a CARDS or an INFILE statement is required

this was the code i used:

 

 

data want;
file 'Y:\My Docs\Alex\TEST.txt';
length line $2000;
input a $char1.;
line=cat(line,a);
if lengthn(line) > 1999 then do;
output;
line="";
end;
run;

 

In terms of cleansing, basically the string contained "bad characters that could not be read" so I need to basically find and replace them.

Super User
Super User
Posts: 8,264

Re: How do I split my data set into multiple files? second

Posted in reply to AlexMoreton

Don't bother. Just check the database definition or SQLServer documentation for what is the longest string that your CLOB can contain and write the code to handle that.  SAS (or you) will not care if the last 10 variables are all blanks.

 

If SQL server complains about using a SUBSTR starting location or length that is longer than the input string then use CASE statement.  You will need to find the SQLserver function that returns the length of the CLOB string.

 

Ask a Question
Discussion stats
  • 6 replies
  • 164 views
  • 1 like
  • 3 in conversation