In the uni library world we encourage students to use the library’s discovery layers and database interfaces to search for information. We tell first-years over and over again not to use Google. Is this the right thing to do?
Here’s an information retrieval story from today when I was looking for journal articles on a certain topic:
- Used library discovery layer. Didn’t like my results.
- Used most recommended database #1, which is small but specialised. Simple 2-term search connected by AND. Some success, but not exactly a jackpot of relevant articles, but a few that were of interest.
- Added some synomyms to my search strategy to broaden. Results were the same.
- Used most recommended database #2. Large, but multidisciplinary. Had to add more terms to refine the search and experiment with my keywords a bit more. Not much success.
- Went back to discovery layer and fiddled with my keywords a bit more. Still not satisfied.
- Went to Google. Put in my 2 keywords, no syntax or synonyms or anything. Based on the nature of my search terms, suggestions from Google scholar appeared at the top of the results list. It suggested two articles that were basically my ~dream articles~ in terms of relevance, and highly cited.
- Clicked on them. Paywall.
- Copied and pasted the article titles into library search. There they were, in databases I hadn’t looked at in steps 3 or 5. Clicked through to full text, and downloaded the pdfs in all their glory.
Now, Google was by far the most helpful tool in terms of discovery (step 6). It was very simple. I didn’t even intentionally go to Google Scholar, just plain old Google. The Scholar results were presented right in my face, there was no effort on my part in doing this. But in terms of access, Google let me down (step 7). I happened to know there was a good chance that the library would have access, so I went looking there, once I had the article titles.
The databases and library search tool were not as good at discovery. It was trickier and more frustrating to find stuff, I had to use some advanced search strategies, and common techniques like broadening my terms didn’t always work (steps 2-5). The few relevant results that I did get were not as good as what I later found on Google. Of course this is not always my experience. It depends on the amount of literature available on your topic, how well your search terms match the vocabulary of the databases and the literature, and endless other factors. My topic happened to be a little bit niche on this occasion, which is the kind of situation when I think Google provides a better search experience. The real value in the library search and library-subscribed databases was the access itself (step 8).
I know this is just one anecdote. But c’mon, Google can be a life saver sometimes. Let’s not demonise it. Maybe next time a student is stuck in a rut not finding relevant information in databases, rather than complicate the search strategy, just use the Google workaround!
Woohoo, I made it!
I had a lot of fun going through the program, and learned a lot about research data that I can apply to my work and share with my colleagues.
My favourites were anything to do with metadata. I also had a good opportunity for reflection and discussion being part of the catch-up webinar in June. I’ve already reflected in my blog on this.
I also used pages on the ANDS website, and other resources linked to in the “things”, throughout my uni subject this semester on research data management. So the program has been immediately helpful for that. I’m also going to plug the program to my colleagues.
I’ll be keeping an eye out for more opportunities to do fun data stuff, especially getting into some of the technical skills that I didn’t do the first time around.
To most people the term “big data” is associated with the data that corporations collect about us through our facebook posts, the GPS on our smartphones, and what products we scan at the checkout. It’s not just Big Data, it’s Big Brother. That is one very pervasive example of it, but “big data” basically refers to all the huge volumes of data that are generated every day, that we currently don’t have the capacity to fully use. I found this video quite helpful and accessible for explaining these ideas:
The mind boggles at how much data is collected using high-tech telescopes and satellites. I looked at the Australian Square Kilometre Array Pathfinder (ASKAP) telescope – which produces 2.5 gigabytes of data per second, or 75 petabytes per year. I was impressed at the collaboration between different organisations to store, analyse, manage and publish the data. Without that high level of collaboration, it wouldn’t be possible.
Here’s the long awaited (ha!) second installment of my Things I Wish I Knew Earlier series. It’s for new things I discover that leave me cursing the fact that I didn’t know about them earlier.
So just in case you’re like me and you have somehow missed the following fact most of your life… Microsoft Word allows you to sort stuff in alphabetical, number or date order.
It’s the friendly little A/Z↓ button in the Paragraph section of the Home ribbon.
It’s a good function if you’re compiling a reference list and not using any reference management software (RIP you). Or any list that you want sorted in alphabetical order. It doesn’t matter what order you add items to your list – you can sort the text alphabetically at the end of the process.
Also works for tables. If you need to re-order your rows, alphabetically, numerically or by date, you can do it. It’s very similar to Excel’s sorting options, but in Word. Who knew?
So stop cutting and pasting and turning your document into a big ol’ mess trying to re-order stuff manually. Word knows how to sort! Hooray!
There’s a great story about an influential economics paper Growth in a Time of Debt by Reinhart & Rogoff, where the dataset had a simple calculation error that was busted by a student when he tried to replicate the study for an assignment. Here’s an article about it that makes for good reading: http://www.bbc.com/news/magazine-22223190.
This incident also appears in European Spreadsheet Risks Interest Group’s list of horror stories (it’s the second entry) http://www.eusprig.org/horror-stories.htm. The Excel formula used for the calculation that was key to the researcher’s argument was AVERAGE(L30:L44) when it should have been AVERAGE(L30:L49) – leaving off 5 rows of data and schewing the results. Oops!
But it’s so easy to make that kind of mistake! The cynic in me wonders if resistance to open data is at least partially influenced by researchers being worried they’ll have made a similar type of mistake somewhere and have it found out. If it can happen to a couple of Harvard professors, it can happen to anyone!
My Excel-Fu is pretty good, but in the past I’ve made mistakes like re-sorting the data in just one column in a table (meaning to re-sort the whole table of course) while all the other columns stayed in their original order. That meant that I had totally mixed up entries in that column and had no way to know which row each cell should have belonged to. The data wasn’t “dirty” so much as completely destroyed! My mistake didn’t stand out to me until I had done some more work on the table and realized something was not right with that column. Even the “Undo” button couldn’t help me – I had to restore to a previous version, undoing quite a bit of work. It’s one of those things where you just need to learn the proper method before you try anything.
I liked learning about Open Refine, and was impressed at its capability for cleaning up inconsistent cells. You can use Excel’s Filter options to clean up consistences, typos, or duplicates (or if you’re entering data into a spreadsheet, you can use the data validation tool to control what values can be put in each cell to prevent it becoming messy), but the process is more manual and Open Refine’s functions make it a lot easier.