My arrival at THATCamp Virginia on Friday will end a week jaunting around western Virginia doing historical research at a variety of archives and libraries for a new book project (hooray for sabbatical!). Between this and a previous archival trip, just for this project alone, I will have accumulated several thousand .jpg images of archival documents, along with webpage captures, plus article PDFs and notes in various formats (.doc, .xcl, and plain old notebooks). I’m already a Zotero user, so this session can go in the direction of getting the most out of that software for research management, but it’s obviously not a very convenient tool for dealing with this giant cache of images. Ideally, I need a way to make images of documents (most of them typewritten) text-searchable without transcribing each one. Since that’s utterly utopian, I’d settle for a good method of converting photo batches into a single PDF file (like that time I had to photograph all 100+ pages of an old government report instead of being able to scan it into a PDF). And/or somehow attaching robust metadata to image files to allow for search and retrieval (currently, I’m making a simple Excel index for my image files, but I’m sure there’s a better way).
I would welcome a session sharing ideas, suggestions, tools and hacks for keeping track of (tagging? coding? databases?), searching across, and wrangling the unwieldy collection of digital ephemera that we create in this new era of web-and-gadget-based research and writing. I’m sure I’m not the only one wrestling this beast!
Update Fri 4/10 8:30pm
This session ran on Friday afternoon 4-5, and I promised to make my notes public – thanks to everyone who participated and gave suggestions, I sincerely apologize I didn’t think to send around a sign-in so I could attribute the ideas to the people who offered them! GREAT conversation!
Main Ideas
Historians generate images during research as “slavish reproductions” (the legal term) of original artifacts. Whether the original item is under copyright or not, the historian owns the image, and in particular owns any metadata s/he creates associated with that image and its contents. The key to keep in mind when COLLECTING archival images is to be meticulous about documenting where the original items lives, to be able to generate an authoritative citation to it in the future. There are lots of ways to do this well, including keeping a careful list of each image, reproducing the archive’s box-folder-series hierarchy with your own folders, or renaming the images with a standard code to indicate archive-collection-box-folder.
It’s also critical to distinguish between the CONTENT (which includes the image, and the stuff on the image rendered into text if possible) and the METADATA for each item. Content and Metadata are different, but should be linked with a unique identifier. Dublin Core is the archival standard for metadata fields and categories, but it’s not the only possibility.
Specific Tools & Suggestions
Text contained on images, especially if typescript, can be extracted using OCR. Results vary of course, and multi-column newspapers or pale mimeographs might be problematic (if you’re working with mid-20th century sources like mine), but it can be a start. Recommended: Tesseract, Adobe Acrobat, Evernote.
For generating robust metadata associated with images, we agreed that for small-scale projects this really does have to be done by hand; there are limited possibilities for automation, but some ideas included: Devon Think, LibraryThing / Library World, Omeka, Microsoft Access and even Flickr or Tumblr. One good suggestion on tags comes from Shane Landrum @cliotropic (who wasn’t even there, but whose brain I had picked on this yesterday) to adapt library MARC-style tags suitable for your specific project. You’d just need a metadata or content management program that will accept punctuation in the tagging field. For example, ones that might work for my project on school closures during Virginia’s massive resistance era:
SU:massive resistance (SUbject)
YR:1959 (YeaR)
ST:VA (STate)
AU:Boyle, Sarah Patton (AUthor)
PL:Warren County (PLace)
SCH:Prince Edward Academy (SCHool)
ORG:NAACP (ORGanization)
DN:SBC (DeNomination)
…etc
Other Issues
–ideas on OCR of other kinds of files like handwritten sheet music, manuscripts in longhand, non-English-language? For those, see Kurt Luther‘s session ideas on crowdsourcing… some of that work might be something MTurk workers could help with
–what is the threshold for automating / writing script / crowdsourcing to do these tasks vs. the valuable intellectual work of doing them by hand oneself
–thinking ahead about whether sharing of one’s scholarly collection of research images might be something to consider – and what that might mean for database construction up front / early on
— the questions you’ll be asking of your data to some extent drives the form your research database will take, but that is a dynamic & evolving thing because you may find that there are some insights you will discover ONLY because your research data is digital, categorized, searchable, and capable of being manipulated with software. That’s a happy thought!
–What did I overlook? Make additions to my notes in the comments!