<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>peawee dot net &#187; data</title>
	<atom:link href="http://peawee.net/tag/data/feed/" rel="self" type="application/rss+xml" />
	<link>http://peawee.net</link>
	<description>design music software life</description>
	<lastBuildDate>Sun, 08 Jan 2012 04:15:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Damn Dirty Data</title>
		<link>http://peawee.net/2010/01/07/damn-dirty-data/</link>
		<comments>http://peawee.net/2010/01/07/damn-dirty-data/#comments</comments>
		<pubDate>Fri, 08 Jan 2010 00:13:53 +0000</pubDate>
		<dc:creator>Matt</dc:creator>
				<category><![CDATA[Geeky Peawee]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[fortran]]></category>
		<category><![CDATA[frustrations]]></category>
		<category><![CDATA[work]]></category>

		<guid isPermaLink="false">http://peawee.net/?p=134</guid>
		<description><![CDATA[No, this post isn&#8217;t Star Trek related. It&#8217;s about SCIENCE! I realize that the real world makes a lot of things difficult for those who collect data. Consider this, then, a gripe list of things I may or may not expect to get fixed. Moving on, I work on computer models to better understand physical [...]]]></description>
			<content:encoded><![CDATA[<p>No, this post isn&#8217;t Star Trek related.  It&#8217;s about SCIENCE!</p>
<p>I realize that the real world makes a lot of things difficult for those who collect data.  Consider this, then, a gripe list of things I may or may not expect to get fixed.</p>
<p>Moving on, I work on computer models to better understand physical processes.  These models are built and parameterized off of real world data of some way or another.  This means I consume a lot of data in my job.  However, how I as a modeler would like data is not often how I get the data.  Here&#8217;s some of the bigger issues:</p>
<ul>
<li><strong>Oddball formats</strong>.  On a given day, I have to deal with files in <a href="http://en.wikipedia.org/wiki/Microsoft_excel">Excel</a> format, varieties of plain text, oddball Fortran binary files (undocumented, of course), <a href="http://en.wikipedia.org/wiki/NetCDF">NetCDF</a>, <a href="http://en.wikipedia.org/wiki/HDF5">HDF</a>, ESRI <a href="http://en.wikipedia.org/wiki/Shapefile">Shapefiles</a>, <a href="http://en.wikipedia.org/wiki/GeoTIFF">GeoTIFF</a>, and a whole bunch of other raster formats.  And this doesn&#8217;t particularly phase me, either.  What phases me?  <em>Incomplete</em> oddball formats.  Whenever a new and interesting bit of geodata ends up on my desk in an oddball format, I can wind up spending over a day poking at it.  When I get gridded data, I need to know a few details about each datapoint in the grid:
<ol>
<li>The lat/lon coordinates</li>
<li>Units being used</li>
<li>The digital representation of the data</li>
</ol>
<p>Most data I get tends to follow this: well-put-together NetCDF &amp; HDF files, as well as anything produced by or with a GIS (such as GeoTIFF, Shapefiles, etc.), tend to be alright.  I can work with this.  However, if I have to figure things out, it slows me down.  Sometimes a lot if I misinterpret it- if it looks reasonable to my eye but actually is not, this Can Cause Baddness.  This is largely the case with Fortran binary files, but I get a lot of plain text that also shares these problems- I have an unclear idea of <em>where</em> or <em>what</em> is being represented.  Examples:</p>
<ol>
<li>Points are addressed from 0 to 360 degrees east instead of -180 to 180 (or vice versa)</li>
<li>There may be a README file with the line &#8220;value is thousands / 20&#8243; or something similarly vague.</li>
</ol>
<li><strong>Oddball Measurements</strong>.  Let&#8217;s start with something basic.  If a value is between zero and one, then why do your measurements include numbers both less than zero and greater than one?  Saying &#8220;sometimes in real life, things work out like that&#8221; doesn&#8217;t satisfy me in building a numerical model where the textbook definition of a function expects a value to be between zero and one.  Tell me <em>why</em> this is, with perhaps a textbook reference going more in depth.  I don&#8217;t mind that your measurements falls outside the traditionally accepted bounds of reality .  I mind that there&#8217;s no systematic explanation why this is.</li>
<li><strong>Oddball time-series</strong>.  In modeling, we want to see a lot of things over time.  The more fine-grained our models become, the more fine-grained we want our measurement data.  If I&#8217;m simulating the growth of a crop, I would like to see the growth of the crop as well as after it&#8217;s done growing.  It&#8217;s hard to determine a trend if one only has one to two datapoints to work from.</li>
</ul>
<p>I suppose the grand summary of this griping is that as a modeler, I have my hands full with a lot of things.  Spending weeks trying to work with bad data is frustrating; human hands entered the data into a computer, so why can&#8217;t those same hands <em>explain</em> the data?</p>
]]></content:encoded>
			<wfw:commentRss>http://peawee.net/2010/01/07/damn-dirty-data/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

