Developing an infrastructure for online publishing and electronic journalsMaintaining points of reference in the digital soupPeterFlynnUniversity College Cork Computer CentreElectronic Publishing UnitOpen Publish 20077-9 March 2007Baltimore, MD1First draft of Abstract2Revised and expandedAbstractThis paper describes the design and implementation of a pilot electronic publishing service in UCC, with particular emphasis on serving electronic journals, project publications, bibliographies, and individual articles.BackgroundOriginsThe growth of academic networking since the early 1980s has tended to obscure the divide which existed—in the first decade and a half at least—in the funding of network connections and services for the Natural Sciences and for the Humanities. For much of this time, network IT resources were more easily available to disciplines whose natural base was mathematics or the laboratory, than to those for whom the Library was the laboratory. In many cases the division was of course not clear-cut, and there were many other factors affecting the provision of resources.The expansion of access to the Internet and the Web in the early and mid 1990s, coupled with the expansion of access to wordprocessing, made it much harder to ignore the needs of those who had no previous track record of involvement with providing information on the network.This did not prevent several UK universities from refusing email addresses to some Humanities staff—on the grounds that they didn't need it—thereby effectively preventing them from participating in lucrative EU-funded research, for which the possession of an institutional email address was regarded as a guarantee of a participant's bona fides.A decade later, the facilities for informal publication, both institutional and individual, are open to anyone, especially within the academic field, where the infrastructure and support has traditionally been at a more advanced level than within corporations or ISPs. The more recent advances, publicly characterised by Tim O'Reilly as Web 2.0® are closer to the original ideals of Ted Nelson's hypertext as non-sequential writing  and his later view of deep electronic literature; and (much later) to Tim Berners-Lee's semantic web ; and to the concept of the digital native attributed to Marc Prensky (which was a topic at IUISC 2006). With fluid, immediate, rewritable, republishable, global hypermedia we have the potential for everyone to publish everything. However, reference points are still needed in the digital soup, which implies that a parallel, more formal approach is still needed which can leverage recent techniques but avoid the moving target syndrome (one of Nelson's recurrent criticisms of the Web).RequirementsThe objective of the electronic publishing service is to provide a platform for the publication of the following classes of documents:Online journals published in UCCText-based research projects (to date mostly in the literary, historical, linguistic, and sociological fields, but by no means restricted to those areas)Postgraduate theses and theses-in-developmentConference publications (abstracts, preprints, proceedings, and individual papers)The UCC Research Output BibliographiesIsolated articles and collections for which no other suitable provision is madeOnline documentation for this and other services of the UnitSome of these are by their nature static documents (eg transcriptions, finished theses); others need to be updated quite frequently on a regular or irregular basis (eg conference papers, bibliographies); or their collection may need to be added to on a periodic cycle (eg journals). Many of these could be published informally elsewhere (and may already have been), but this can make them harder to find, and requires a greater foreknowledge of the technology than is reasonable to expect from many users.In the case of informal publication, it is relatively straightforward to upload documents to a departmental, project, or personal website; and there are numerous packages available (including content management systems) for wrapping collections of documents in a common visual format.Facilities for formal publication are far less common and tend to be in-house corporate systems developed by individual publishers. The principal differences between formal and informal systems are:Informal publicationAn informal system is generic and relies on the IT skills of the editor or publisher to upload and maintain documents in position, and ensure that its surrounding navigation and decoration are conformant to the design and standards of the site. The emphasis is on visual appeal and immediacy, and there is no provision for persistence.Formal publicationFormal publication is more automated: document formatting is more robust (and necessarily more restrictive); more time is available for real editorial intervention; there is a reliable naming scheme for documents and their URIs; documents are indexable and internally addressable; metadata is available; and publication-quality print-formatted copies are available on demand.It is important to note that an electronic publishing system—formal or informal—does not absolve the author or editor of the traditional responsibility of ensuring that each document is checked or peer-reviewed for suitability, coherence, grammar, spelling, reading level, accuracy, copyright conformance, and citation. These activities, known as content-editing, are human activities, and no amount of automation can substitute for good-quality editorial control.For use in the university, especially for electronic journals, a formal system was clearly needed. The specific features identified were (in no particular order):The act of publishing must be as effort-free as possible, requiring only a simple drag-and-drop action.Document collection (grouping by journal, volume, issue, etc) must be automatic and require no editorial intervention.The URIs created for each document or collection must be short, human-comprehensible, and persistent.The amount of file-editing (as distinct from content-editing) must be limited to the minimum required to ensure consistent identification; but this should not preclude more complex structures when needed.Presentation must be entirely automatic, invisibly based on the identification of the component parts of a document. Rendering conflicts (eg bulleted numbered lists) should be readily identifiable for the editor before publication).The system must maximise the exposure of journals, projects, and particularly UCC authors.It must be capable of generating publication-quality print formatting as well as web formatting, and other outputs (eg Braille, audio) at a future stage.It must be possible to serve a long document automatically in shorter sections, for ease of use.Pilot phaseA pilot phase to demonstrate proof-of-concept was needed before any funding could be made available. Additional demands included the hosting of resources such as the published UCC Research Bibliography (taken from the campus InfoEd system), and the provision of more specialist document services for some research projects.To ensure persistence and supportability, only non-proprietary solutions were considered. Cocoon was chosen over other document-server solutions (AxKit, PropelX) in order to minimise development time, maximise stability and persistence, and avoid the need for third-party binary-only APIs, unmaintainable and undocumented Perl hacks, etc.The initial method was to support the DocBook and TEI-Lite markup vocabularies for the pilot articles and documentation. However, as there is no suitable editing software available yet for non-XML-experts, provision was also made for documents authored in a wordprocessor with Named Styles and saved as undistinguished XML (ODF , WordML, or Office Open XML).The pilot phase has been successful and the next phase of implementation is being started. The lessons learned from the pilot were invaluable in directing the implementation as well as the editorial training requirements and the recommendations to authors for the use of Named Styles in wordprocessors. The use and re-use of XSLT throughout enabled relatively complex navigation and formatting requirements to be tested.ImplementationThe Cocoon Framework has demonstrated itself during the pilot phase as very capable of handling both static and dynamic documents, and of providing an easily maintainable structure for presentation. This platform therefore became the target for the full service, but the most essential component was that the documents themselves should continue to be in a persistent format, able to withstand changes in technology without themselves needing to be changed, regardless of the format in which they get served (eg HTML, PDF). Obsolete and legacy file formats such as .doc, .xls, and their earlier equivalents do not fulfil these requirements and are therefore not directly usable, but can nowadays easily be saved as XML.The current stylesheet was based on the site-wide campus default, but this is scheduled to be changed because the original was not designed with normal text documents in mind. The core behaviour of a document's contents (when served as HTML)—headings, paragraphs, lists, tables, figures, quotations, etc—will automatically adapt itself to the geometry of the window in a browser and the specifications of whatever CSS styles are in effect, so creating a new web page layout for the default style or for a specific journal or project only requires some higher-level changes in the XSLT stylesheet and some additional CSS.Changes to the core behaviour are also straightforward, but are undertaken much more rarely because of the need to ensure consistency, both for appearances as well as for usability and accessibility. XSLT recognises only the logically-identifiable component parts of a document that have been specified, so the default is to ignore the large quantity of formatting which wordprocessors and authors add to documents for their own benefit, but which are completely irrelevant for publication. A good example is the changes in typeface, font, style, and spacing which authors use to make editing easier for themselves, or which wordprocessors add (often arbitrarily or inconsistently) because they are designed to preserve absolutely the author's formatting for printing purposes, neither of which is of any relevance whatsoever when Named Style markup is used to identify the component parts of the document.Journals and conference papersA naming scheme for regularly-reocurring classes of documents was devised so that an editor could upload a group of documents and have them appear immediately in the right place. The Cocoon sitemap file (which controls the URIs by which documents are accessed) works on the basis of pattern-matching, so the required documents only need to match the expected pattern, for example journal/issue/author.The filename pattern seq-author-year-vol-issue.xml provides enough information for the sitemapper to identify the document when asked for the prescribed URI. The seq allows the editor to control the order of the articles in an issue (listed by default when the journal/issue URI is requested). It also makes it very straightforward to get Cocoon to serve alternative views of the journal (by author, for example, or by year) without the need for any additional programming.Identifying the sections and subsections of long documents makes it possible to generate an automated table of contents, with the section title being links to the text. Cocoon can then serve the documents in separately-addressable chunks without the need for user intervention. This has several benefits: it reduces the time taken for the page to be openedit means that even very long documents can be hosted without risking browser instabilityit makes document fragments separately linkable (see )because it uses the ID/IDREF mechanism of XML, it means that links between document sections are transparently preserved even across what appear as separate HTML pages.Project documents and thesesProjects tend to generate large quantities of ancillary documentation—outlines, descriptions, specifications, bids for funding, periodic reports, interim results, etc. These can all be given a consistent appearance, and listed in suitable formats (document type, year, etc) using a method similar to that outlined for journals.Current policy is to provide postgraduate students with web space for a home page for their research, so that they can publicly stake their claim to their field and publish whatever interim material they wish (if any), as well as making their final thesis available for download at a persistent URI. Moving this service to the electronic publishing framework means the usual assortment of mailing lists, blogs, wikis, surveys, etc can also be provided, and their publishable material can be handled with the same technology as other documents.BibliographiesThe UCC InfoEd system provides a managed environment for maintaining research metadata (funding, participants, duration, CVs, publications, etc), but the bibliographic data facilities are very poorly designed and not easily accessible. A periodic extract of the data has been tested, and can be both aggregated to departmental level and categorised by type of publication per author, so that an efficient online bibliography is available for every researcher on campus. Individuals continue to maintain their data in InfoEd, but an automated weekly extract can be served via Cocoon in the standard UCC web style. The current test has identified a number of serious flaws in the way in which InfoEd stores the data, and the finished implementation must wait until the current version of InfoEd is updated.In generating the references, the internal structure of the bibliographic data is used to provide for a download of individual citations in a form suitable for direct inclusion into a user's personal bibliographic database (eg Reference Manager, EndNote, ProCite, JabRef, etc). This makes it very easy for anyone to cite works by a UCC author in the correct format in their own writings. The same technique is used in the journal and conference formats, but for the published papers only, not for the references within them, unless the paper is marked up using a format which can be used analytically, such as TEI or DocBook.Documentation and isolated documentsThere is occasional demand both from the academic and the administrative communities for a place to put an individual document which is outside the scope of the design of the main campus web site, or which has some special applicability.The Unit maintains a growing quantity of documentation about its work and about the web. Because of the closely-related nature of these documents, they are more easily maintained in XML than anything else, so this server is a natural place to keep and serve them.One of the major benefits (referred to ) is the ability to serve documents in chunks which can be linked to separately, even though the document is a much larger entity, without the need to download the whole document or have to scroll through it to find the reference.ConclusionsThe success of the pilot phase has already drawn attention from half a dozen departments and projects in the College who either have documents they want to make available online or who are planning to start an online journal.None of this would have been viable if the major wordprocessors had not chosen to provide XML as a save format. Despite the horrendous complexity of what has called a memory dumps [of the internal data structures] with angle brackets around them (ODF and OOXML) , it is not impossible to extract the relevant identifiable text and present it on the web or format it for print. The non-interventionist policy of using Named Styles means authors and editors can simply upload the documents they want published and see them appear in the right placeThere is in fact a preview period before indexing, in order to give uploaders a chance to withdraw and edit a document if a problem is discovered at the last moment.The project has also been the trigger for some of the developments in web formatting, based on the requirements of some of the College's existing projects now moving their service data into an interactive XML environment:the implementation of the CSS LightBox technique due to to display epexegetic information like notes, references, and variant readings transiently without the need to open a new window;the provision of selectable downloads of bibliographic data in reusable format;the algorithm for positioning footnotes in a medium which is pageless (browser HTML);the use of XSLT and to provide publication-formatted PDF copies of documents in real time (as an alternative to the time-consuming XSL:FO methods).LokeshDhakarLightbox JS v2.0http://www.huddletogether.com/projects/lightbox2/Håkon WiumLieMicrosoft's amusing standards stancec|net NewsPerpsectiveshttp://news.com.com/Microsofts+standards+choice/2010-1013_3-6161285.htmlTimO'ReillyWhat Is Web 2.0Design Patterns and Business Models for the Next Generation of SoftwareO'Reilly Net: About Tim O'Reillyhttp://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.htmlTedNelsonLiterary Machines87.1Sausalito PressSwarthmore, PATimBerners-LeeJamesHendlerOraLassila The Semantic WebA new form of Web content that is meaningful to computers will unleash a revolution of new possibilitiesScientific Americanhttp://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21MarcPrenskyDigital Natives, Digital ImmigrantsOn the Horizon95http://www.marcprensky.com/writing/Prensky%20-%20Digital%20Natives,%20Digital%20Immigrants%20-%20Part1.pdfJTC 1/SC 34; ISO StandardsInformation technology—Open Document Format for Office Applications (OpenDocument) v1.0ISO/IEC 26300:2006International Organization for StandardizationGenevaISO/IEC 26300:2006