Developing an infrastructure for online publishing and
electronic journalsMaintaining points of reference in the digital soupPeterFlynnUniversity College Cork Computer CentreElectronic
Publishing UnitOpen Publish 20077-9 March 2007Baltimore, MD1First draft of Abstract2Revised and expandedAbstractThis paper
describes the design and implementation of a
pilot electronic publishing service in UCC, with particular
emphasis on serving electronic journals, project publications,
bibliographies, and individual articles.BackgroundOriginsThe growth of academic networking since the early 1980s
has tended to obscure the divide which existed—in the
first decade and a half at least—in the funding of
network connections and services for the Natural Sciences and
for the Humanities. For much of this time, network IT
resources were more easily available to disciplines whose
natural base was mathematics or the laboratory, than to those
for whom the Library was the laboratory. In many cases the
division was of course not clear-cut, and there were many
other factors affecting the provision of resources.The expansion of access to the Internet and the Web in the
early and mid 1990s, coupled with the expansion of access to
wordprocessing, made it much harder to ignore the needs of
those who had no previous track record of involvement with
providing information on the network.This did not prevent several UK universities from
refusing email addresses to some Humanities staff—on
the grounds that they didn't need
it—thereby effectively preventing them
from participating in lucrative EU-funded research, for
which the possession of an institutional email address was
regarded as a guarantee of a participant's
bona fides.A decade later, the facilities for informal publication,
both institutional and individual, are open to anyone,
especially within the academic field, where the infrastructure
and support has traditionally been at a more advanced level
than within corporations or ISPs. The more recent advances,
publicly characterised by Tim O'Reilly as Web 2.0®
are closer to the original ideals of Ted Nelson's hypertext as
non-sequential writing and his later view of deep
electronic literature; and (much later) to Tim
Berners-Lee's semantic web ; and to the concept of the digital
native attributed to Marc Prensky (which was a topic at IUISC 2006). With
fluid, immediate, rewritable, republishable, global hypermedia
we have the potential for everyone to publish everything.
However, reference points are still needed in the digital
soup, which implies that a parallel, more formal approach is
still needed which can leverage recent techniques but avoid
the moving target syndrome (one of Nelson's
recurrent criticisms of the Web).RequirementsThe objective of the electronic publishing service is to
provide a platform for the publication of the following
classes of documents:Online journals published in UCCText-based research projects (to date mostly in the
literary, historical, linguistic, and sociological fields,
but by no means restricted to those areas)Postgraduate theses and theses-in-developmentConference publications (abstracts,
preprints,
proceedings, and individual papers)The UCC Research Output BibliographiesIsolated articles and collections for which
no other
suitable provision is madeOnline documentation for this and other services of
the UnitSome of these are by their nature static documents (eg
transcriptions,
finished theses); others need to be updated quite frequently
on a regular or irregular basis (eg conference papers,
bibliographies); or their collection may need to be added to
on a periodic cycle (eg journals). Many of these could be
published informally elsewhere (and may already have been),
but this can make them harder to find, and requires a greater
foreknowledge of the technology than is reasonable to expect
from many users.In the case of informal publication, it is relatively
straightforward to upload documents to a departmental,
project, or personal website; and there are numerous packages
available (including content management systems) for wrapping
collections of documents in a common visual format.Facilities for formal publication are far less common and
tend to be in-house corporate systems developed by individual
publishers. The principal differences between formal and
informal systems are:Informal publicationAn informal system is generic and relies on the IT
skills of the editor or publisher to upload and maintain
documents in position, and ensure that its surrounding
navigation and decoration are conformant to the design
and standards of the site. The emphasis is on visual
appeal and immediacy, and there is no provision for
persistence.Formal publicationFormal publication is more automated: document
formatting is more robust (and necessarily more
restrictive); more time is available for
real editorial intervention;
there is a reliable naming scheme for documents and
their URIs; documents are indexable and internally
addressable; metadata is available; and
publication-quality print-formatted copies are available
on demand.It is important to note that an electronic publishing
system—formal or informal—does not absolve the
author or editor of the traditional responsibility of ensuring
that each document is checked or peer-reviewed for
suitability, coherence, grammar, spelling, reading level,
accuracy, copyright conformance, and citation. These
activities, known as content-editing, are human activities,
and no amount of automation can substitute for good-quality
editorial control.For use in the university, especially for electronic
journals, a formal system was clearly needed. The specific
features identified were (in no particular order):The act of publishing must be as effort-free as
possible, requiring only a simple drag-and-drop
action.Document collection (grouping by journal, volume,
issue, etc) must be automatic and require no editorial
intervention.The URIs created for each document or collection must
be short, human-comprehensible, and persistent.The amount of file-editing (as distinct from
content-editing) must be limited to the minimum required
to ensure consistent identification; but this should not
preclude more complex structures when needed.Presentation must be entirely automatic, invisibly
based on the identification of the component parts of a
document. Rendering conflicts (eg bulleted numbered lists)
should be readily identifiable for the editor before
publication).The system must maximise the exposure of journals,
projects, and particularly UCC authors.It must be capable of generating publication-quality
print formatting as well as web formatting, and other
outputs (eg Braille, audio) at a future stage.It must be possible to serve a long document
automatically in shorter sections, for ease of use.Pilot phaseA pilot phase to demonstrate proof-of-concept was needed
before any funding could be made available. Additional demands
included the hosting of resources such as the published UCC
Research Bibliography (taken from the campus
InfoEd system), and the provision
of more specialist document services for some research
projects.To ensure persistence and supportability, only
non-proprietary solutions were considered. Cocoon was chosen
over other document-server solutions (AxKit, PropelX) in order
to minimise development time, maximise stability and
persistence, and avoid the need for third-party binary-only
APIs, unmaintainable and undocumented Perl hacks, etc.The initial method was to support the DocBook and TEI-Lite
markup vocabularies for the pilot articles and documentation.
However, as there is no suitable editing software available
yet for non-XML-experts, provision was also made for documents
authored in a wordprocessor with Named Styles and saved as
undistinguished XML (ODF , WordML, or Office Open
XML).The pilot phase has been successful and the next phase of
implementation is being started. The lessons learned from the
pilot were invaluable in directing the implementation as well
as the editorial training requirements and the recommendations
to authors for the use of Named Styles in wordprocessors. The
use and re-use of XSLT throughout enabled relatively complex
navigation and formatting requirements to be tested.ImplementationThe Cocoon Framework has demonstrated itself during the
pilot phase as very capable of handling both static and dynamic
documents, and of providing an easily maintainable structure for
presentation. This platform therefore became the target for the
full service, but the most essential component was that the
documents themselves should continue to be in a persistent
format, able to withstand changes in technology without
themselves needing to be changed, regardless of the format in
which they get served (eg HTML, PDF). Obsolete and legacy file
formats such as .doc,
.xls, and their earlier equivalents do not
fulfil these requirements and are therefore not directly usable,
but can nowadays easily be saved as XML.The current stylesheet was based on the site-wide campus
default, but this is scheduled to be changed because the
original was not designed with normal text documents in mind.
The core behaviour of a document's contents (when served as
HTML)—headings, paragraphs, lists, tables, figures,
quotations, etc—will automatically adapt itself to the
geometry of the window in a browser and the specifications of
whatever CSS styles are in effect, so creating a new web page
layout for the default style or for a specific journal or
project only requires some higher-level changes in the XSLT
stylesheet and some additional CSS.Changes to the core behaviour are also straightforward, but
are undertaken much more rarely because of the need to ensure
consistency, both for appearances as well as for usability and
accessibility. XSLT recognises only the logically-identifiable
component parts of a document that have been specified, so the
default is to ignore the large quantity of formatting which
wordprocessors and authors add to documents for their own
benefit, but which are completely irrelevant for publication. A
good example is the changes in typeface, font, style, and
spacing which authors use to make editing easier for themselves,
or which wordprocessors add (often arbitrarily or
inconsistently) because they are designed to preserve absolutely
the author's formatting for printing purposes, neither of which
is of any relevance whatsoever when Named Style markup is used to
identify the component parts of the document.Journals and conference papersA naming scheme for regularly-reocurring classes
of
documents was devised so that an editor could upload a group
of documents and have them appear immediately in the right
place. The Cocoon sitemap file (which
controls the URIs by which documents are accessed) works on
the basis of pattern-matching, so the required documents only
need to match the expected pattern, for example
journal/issue/author.The filename pattern
seq-author-year-vol-issue.xml
provides enough information for the sitemapper to identify the
document when asked for the prescribed URI. The
seq allows the editor to control the order
of the articles in an issue (listed by default when the
journal/issue
URI is requested). It also makes it very straightforward to
get Cocoon to serve alternative views of the journal (by
author, for example, or by year) without the need for any
additional programming.Identifying the sections and subsections of
long documents makes it possible to generate an automated
table of contents, with the section title being links to the
text. Cocoon can then serve the documents in separately-addressable
chunks without the need for user
intervention. This has several benefits: it reduces the time taken for the page to be
openedit means that even very long documents can be hosted
without risking browser instabilityit makes document fragments separately linkable (see
)because it uses the ID/IDREF mechanism of XML, it
means that links between document sections are
transparently preserved even across what appear as
separate HTML pages.Project documents and thesesProjects tend to generate large quantities of ancillary
documentation—outlines, descriptions, specifications,
bids for funding, periodic reports, interim results, etc.
These can all be given a consistent appearance, and listed in
suitable formats (document type, year, etc) using a method
similar to that outlined for journals.Current policy is to provide postgraduate students with
web space for a home page for their research, so that they can
publicly stake their claim to their field and publish whatever
interim material they wish (if any), as well as making their
final thesis available for download at a persistent URI.
Moving this service to the electronic publishing framework
means the usual assortment of mailing lists, blogs, wikis,
surveys, etc can also be provided, and their publishable
material can be handled with the same technology as other
documents.BibliographiesThe UCC InfoEd system provides
a managed environment for maintaining research metadata
(funding, participants, duration, CVs, publications, etc), but
the bibliographic data facilities are very poorly designed and
not easily accessible. A periodic extract of the data has been
tested, and can be both aggregated to departmental level and
categorised by type of publication per author, so that an
efficient online bibliography is available for every
researcher on campus. Individuals continue to maintain their
data in InfoEd, but an automated
weekly extract can be served via Cocoon in the standard UCC
web style. The current test has identified a number of serious
flaws in the way in which InfoEd
stores the data, and the finished implementation must wait
until the current version of InfoEd
is updated.In generating the references, the internal structure of
the bibliographic data is used to provide for a download of
individual citations in a form suitable for direct inclusion
into a user's personal bibliographic database (eg
Reference Manager,
EndNote,
ProCite,
JabRef, etc). This makes it very
easy for anyone to cite works by a UCC author in the correct
format in their own writings. The same technique is used in
the journal and conference
formats, but for the published papers only, not for
the references within them, unless the paper is marked up
using a format which can be used analytically, such as TEI or
DocBook.Documentation and isolated documentsThere is occasional demand both from the academic and the
administrative communities for a place to put
an individual document which is outside the scope of the
design of the main campus web site, or which has some special
applicability.The Unit maintains a growing quantity of documentation
about its work and about the web. Because of the
closely-related nature of these documents, they are more
easily maintained in XML than anything else, so this server is
a natural place to keep and serve them.One of the major benefits (referred to ) is the ability to serve
documents in chunks which can be linked to separately, even
though the document is a much larger entity, without the need
to download the whole document or have to scroll through it to
find the reference.ConclusionsThe success of the pilot phase has already drawn attention
from half a dozen departments and projects in the College who
either have documents they want to make available online or who
are planning to start an online journal.None of this would have been viable if the major
wordprocessors had not chosen to provide XML as a save format.
Despite the horrendous complexity of what has called a memory
dumps [of the internal data structures] with angle brackets
around them (ODF and OOXML) , it is not impossible to extract the relevant
identifiable text and present it on the web or format it for
print. The non-interventionist policy of using Named Styles
means authors and editors can simply upload the documents they
want published and see them appear in the right placeThere is in fact a preview period before indexing, in
order to give uploaders a chance to withdraw and edit a
document if a problem is discovered at the last
moment.The project has also been the trigger for some of the
developments in web formatting, based on the requirements of
some of the College's existing projects now moving their service
data into an interactive XML environment:the implementation of the CSS
LightBox technique due to to
display epexegetic information like notes, references, and
variant readings transiently without the need to open a new
window;the provision of selectable downloads of bibliographic
data in reusable format;the algorithm for positioning footnotes in a medium
which is pageless (browser HTML);the use of XSLT and to provide
publication-formatted PDF copies of documents in real time
(as an alternative to the time-consuming XSL:FO
methods).LokeshDhakarLightbox JS v2.0http://www.huddletogether.com/projects/lightbox2/Håkon WiumLieMicrosoft's amusing
standards stancec|net NewsPerpsectiveshttp://news.com.com/Microsofts+standards+choice/2010-1013_3-6161285.htmlTimO'ReillyWhat
Is Web 2.0Design Patterns and Business Models for the Next
Generation of SoftwareO'Reilly Net: About Tim O'Reillyhttp://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.htmlTedNelsonLiterary
Machines87.1Sausalito PressSwarthmore, PATimBerners-LeeJamesHendlerOraLassila The Semantic WebA new form of Web content that
is meaningful to
computers will unleash a revolution of new
possibilitiesScientific Americanhttp://www.sciam.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21MarcPrenskyDigital
Natives, Digital ImmigrantsOn the Horizon95http://www.marcprensky.com/writing/Prensky%20-%20Digital%20Natives,%20Digital%20Immigrants%20-%20Part1.pdfJTC
1/SC 34; ISO StandardsInformation technology—Open Document Format for
Office Applications (OpenDocument) v1.0ISO/IEC 26300:2006International Organization for
StandardizationGenevaISO/IEC 26300:2006