To most people the term “big data” is associated with the data that corporations collect about us through our facebook posts, the GPS on our smartphones, and what products we scan at the checkout. It’s not just Big Data, it’s Big Brother. That is one very pervasive example of it, but “big data” basically refers to all the huge volumes of data that are generated every day, that we currently don’t have the capacity to fully use. I found this video quite helpful and accessible for explaining these ideas:
The mind boggles at how much data is collected using high-tech telescopes and satellites. I looked at the Australian Square Kilometre Array Pathfinder (ASKAP) telescope – which produces 2.5 gigabytes of data per second, or 75 petabytes per year. I was impressed at the collaboration between different organisations to store, analyse, manage and publish the data. Without that high level of collaboration, it wouldn’t be possible.
Here’s the long awaited (ha!) second installment of my Things I Wish I Knew Earlier series. It’s for new things I discover that leave me cursing the fact that I didn’t know about them earlier.
So just in case you’re like me and you have somehow missed the following fact most of your life… Microsoft Word allows you to sort stuff in alphabetical, number or date order.
It’s the friendly little A/Z↓ button in the Paragraph section of the Home ribbon.
It’s a good function if you’re compiling a reference list and not using any reference management software (RIP you). Or any list that you want sorted in alphabetical order. It doesn’t matter what order you add items to your list – you can sort the text alphabetically at the end of the process.
Also works for tables. If you need to re-order your rows, alphabetically, numerically or by date, you can do it. It’s very similar to Excel’s sorting options, but in Word. Who knew?
So stop cutting and pasting and turning your document into a big ol’ mess trying to re-order stuff manually. Word knows how to sort! Hooray!
There’s a great story about an influential economics paper Growth in a Time of Debt by Reinhart & Rogoff, where the dataset had a simple calculation error that was busted by a student when he tried to replicate the study for an assignment. Here’s an article about it that makes for good reading: http://www.bbc.com/news/magazine-22223190. My economist brother told me about it a while ago, and it instantly sprung to mind.
This incident also appears in European Spreadsheet Risks Interest Group’s list of horror stories (it’s the second entry) http://www.eusprig.org/horror-stories.htm. The Excel formula used for the calculation that was key to the researcher’s argument was AVERAGE(L30:L44) when it should have been AVERAGE(L30:L49) – leaving off 5 rows of data and schewing the results. Oops!
But it’s so easy to make that kind of mistake! The cynic in me wonders if resistance to open data is at least partially influenced by researchers being worried they’ll have made a similar type of mistake somewhere and have it found out. If it can happen to a couple of Harvard professors, it can happen to anyone!
My Excel-Fu is pretty good, but in the past I’ve made mistakes like re-sorting the data in just one column in a table (meaning to re-sort the whole table of course) while all the other columns stayed in their original order. That meant that I had totally mixed up entries in that column and had no way to know which row each cell should have belonged to. The data wasn’t “dirty” so much as completely destroyed! My mistake didn’t stand out to me until I had done some more work on the table and realized something was not right with that column. Even the “Undo” button couldn’t help me – I had to restore to a previous version, undoing quite a bit of work. It’s one of those things where you just need to learn the proper method before you try anything.
I liked learning about Open Refine, and was impressed at its capability for cleaning up inconsistent cells. You can use Excel’s Filter options to clean up consistences, typos, or duplicates (or if you’re entering data into a spreadsheet, you can use the data validation tool to control what values can be put in each cell to prevent it becoming messy), but the process is more manual and Open Refine’s functions make it a lot easier.
I think what makes geospatial visualisations powerful is that they are able to overlay different information which lets you see a connection. e.g. in the PetaJakarta project they overlay a map of different neighbourhoods, river networks, geotags from Twitter, colour-coded areas of high activity, etc. This means that they are bringing different data together to form the big picture and identify situations where these factors cross – it’s a quick and effective way to see needs and respond to them.
If 80% of all research data has a geographic or spatial component, it’s important to be able to search datasets by geography! I love the spatial search interface and being able to draw a box over a physical area on the map. And once you run the search, you can see that it translates to a range of co-ordinates. Cool stuff! I think this could definitely lead to new research questions – it is an obvious way to identify ‘gaps’ in a subject area where a physical location hasn’t been studied yet.
I was impressed with Trove Application Gallery. I think it’s neat that Trove put their API out there for free, there are some pretty cool projects that people have made. It makes data more ‘fun’ and usable for different audiences.
I really enjoyed reading the 8 apps that turn citizens into scientists article. I think the idea of the citizen scientist is great, and it’s a great way to leverage the pervasive technology of smart phones to collect a large amount of data. But omg, number 5… people-watchers and drama-seekers should get a kick out of that one, and can rest assured they’ll be contributing to scientific research in the process! 👀