I got a txt file with millions of observations. There are three variables for each obs, like name subname and number.
However, subname is optional without any special notice.
part of the data:
John (tech) (43) Johnson(econ) (32) Julian (24) Justin (34) Jo (math) (32)
Julia(econ) (33) June (93)
....
This can be done in a number of ways. One is to read your data into three variables, then checking if your second variable contains any digits (using ANYDIGIT function), if so move the contents to the third variable.
Also the problem is in general not too difficult to solve I can think of a few challenges which might occur on how your raw data look like.
Is the record structure really the way you show it to us (several 'observations' in one line)? Could it be that the name is missing (quite possible if there are millions of records) and that therefore you could have 2 to 4 consecutive values in brackets belonging to 2 different 'observations'?
Please let me know as the concrete solution will depend on how the data looks like.
The easiest way I can think about right now is to use Regular Expressions (funcions PRX.. in SAS) to decide which substring makes up an 'observation' - but Regular Expressions need also some practice to use and understand.
a big thank you to linux and patrick
With your recommendations, I find my problem. It focus on the structure of the raw data, cauz the raw data is too rough.
It is solved now. Thank you for your time!