Parsing a Tabular Data File

Parsing Basics

In computing, parsing is the process of analyzing a text, breaking it into its logical components, and translating it into a more useful form. In architectural design, parsing is most often understood as reading a given file and producing a data structure required to accomplish a larger goal. Many of these files adopt a structure similar to the CSV (Comma Separated Values) format, which is the most common import and export format for spreadsheets and databases. Files such as this are structured like a table , with rows representing individual entries and columns representing a specific value for each entry. Preceding the main body of the data there may be a "preamble" that defines global values and metadata, and a "header" which defines the data elements associated with each column.

The basic outline for parsing such files looks like this:

function parse file:
    input: path-to-file-to-parse
    output: a useful data-structure
    load file into memory
    for each line in the file:
         parse line
         create entry in data-structure
    repeat
    return data structure
end function
function parse line:
    input: a comma-separated string
    output: an ordered array of values
    split the string according to a delimiter
    store each value in an array
    return the array
end function


Example 1: Extracting a Single Column from a Tabular Data File

In this first example, we will be working with the following example file dirNormals.txt which has 4 columns of data, corresponding with the month, day, hour and direct normal irradiation measured at a given location throughout a year. The goal is to be able to extract out that column of information corresponding to the direct normals and output it into a list called dirNormalIrad.

Following the outline given above, we can write two functions, one that parses the entire file, and another which parses individual lines of the file, to extract this data.

def parseEx_file(filename, skip, col):
	file = open(filename)
	lineno = 0 # keeps track of how many lines have been parsed
	dataOut = []
	for line in file:
		if lineno > skip-1 : # only parse this line if past the header
			dataCol = parseEx_line(line, col)
			if dataCol is not None:
				dataOut.append(dataCol)
		lineno += 1 # keep track of how many lines have been parsed
	file.close()
	return dataOut
 
def parseEx_line(string, col):
  #split this string using whitespace as delimiter
  data = [int(n) for n in string.split()] 
  return data[col-1] #return only the first three values
 
 
#Usage:
fileName = "dirNormals.txt"
number_of_header_lines = 2
columnNumber = 4 #the column that we want to extract
dirNormalData = parseEx_file(fileName, number_of_header_lines, columnNumber)


Whenever you are parsing a data file, you should keep in mind that the data file may be incomplete. In this case, I've made sure that the file dirNormals.txt is a complete file (i.e. has all 8760 hours in a year represented). Therefore, it is quite easy now to extract out the direct normals for any (whole) hour of the year.

import solarGeom as sg
 
date = "3/23"
h = "13:00"
day = sg.calc_dayOfYear(date)
hour = sg.calc_hourDecimal(h)
hourYear = (day-1)*24 + int(hour)-1
print "Direct Normal Irradiation on "+date+" at "+h+" is "+str(dirNormalData[hourYear])+" w/m^2"