Parsing a Tabular Data File
Parsing Basics
In computing, parsing is the process of analyzing a text, breaking it into its logical components, and translating it into a more useful form. In architectural design, parsing is most often understood as reading a given file and producing a data structure required to accomplish a larger goal. Many of these files adopt a structure similar to the CSV (Comma Separated Values) format, which is the most common import and export format for spreadsheets and databases. Files such as this are structured like a table , with rows representing individual entries and columns representing a specific value for each entry. Preceding the main body of the data there may be a "preamble" that defines global values and metadata, and a "header" which defines the data elements associated with each column.
The basic outline for parsing such files looks like this:
function parse file: input: path-to-file-to-parse output: a useful data-structure load file into memory for each line in the file: parse line create entry in data-structure repeat return data structure end function
function parse line: input: a comma-separated string output: an ordered array of values split the string according to a delimiter store each value in an array return the array end function
Example 1: Extracting a Single Column from a Tabular Data File
In this first example, we will be working with the following example file dirNormals.txt which has 4 columns of data, corresponding with the month, day, hour and direct normal irradiation measured at a given location throughout a year. The goal is to be able to extract out that column of information corresponding to the direct normals and output it into a list called dirNormalIrad.
Following the outline given above, we can write two functions, one that parses the entire file, and another which parses individual lines of the file, to extract this data.
def parseEx_file(filename, skip, col): file = open(filename) lineno = 0 # keeps track of how many lines have been parsed dataOut = [] for line in file: if lineno > skip-1 : # only parse this line if past the header dataCol = parseEx_line(line, col) if dataCol is not None: dataOut.append(dataCol) lineno += 1 # keep track of how many lines have been parsed file.close() return dataOut def parseEx_line(string, col): #split this string using whitespace as delimiter data = [int(n) for n in string.split()] return data[col-1] #return only the first three values #Usage: fileName = "dirNormals.txt" number_of_header_lines = 2 columnNumber = 4 #the column that we want to extract dirNormalData = parseEx_file(fileName, number_of_header_lines, columnNumber)
Whenever you are parsing a data file, you should keep in mind that the data file may be incomplete. In this case, I've made sure that the file dirNormals.txt is a complete file (i.e. has all 8760 hours in a year represented). Therefore, it is quite easy now to extract out the direct normals for any (whole) hour of the year.
import solarGeom as sg date = "3/23" h = "13:00" day = sg.calc_dayOfYear(date) hour = sg.calc_hourDecimal(h) hourYear = (day-1)*24 + int(hour)-1 print "Direct Normal Irradiation on "+date+" at "+h+" is "+str(dirNormalData[hourYear])+" w/m^2"