This article introduces the project A Big Data History of Music, which set out to unlock the bibliographical data held by research libraries in order to create new research opportunities for musicologists. The project cleaned and enhanced aspects of the British Library catalogues of printed and manuscript music, which are now available as open data. It also experimented with the analysis and visualization of the British Library datasets and the RISM inventories of printed and manuscript music. The article shows how quantitative analysis of these datasets can expose long-term historical trends, such as the rise and fall of music printing in 16th- and 17th-century Europe. Data analysis and visualization also facilitates research on the dissemination and canonization of specific composers (as shown by case-studies on Palestrina and Purcell) and on changing trends in genres, scoring and ethnic colourings in music (as shown by a case-study on ‘Scottish’ music).
Big Data has been defined as information that requires special processing techniques because it exists in large quantities, is highly heterogeneous, or is produced extremely quickly.1 Big Data is usually associated with major scientific endeavours such as the Large Hadron Collider or the Human Genome Project. These projects produce millions of gigabytes of data annually, to be analysed collaboratively by scientists spread over many nations. Yet humanities scholars have also been mining large datasets, such as the full-text archives produced by optical character recognition (OCR) of digitized books and other scanned documents. In 2014 the historians Jo Guldi and David Armitage called on their discipline to use quantitative analysis to understand long-term historical change—for instance, to plot the effects of climate change or the varying distribution of wealth. For Guldi and Armitage, such large-scale data analysis can help show the synchronicity and interdependence of global events, countering a focus on small case studies and microhistory.2
Literary historians have also explored the new perspectives offered by large quantities of data. Franco Moretti pioneered the technique of ‘distant reading’, in which he analysed bibliographical data such as the titles of novels and their publication details. Whereas the ‘close reading’ typically practised by literary critics focuses on a few canonized texts, Moretti sought through his quantitative analyses to gain an overview of the production of novels across the 18th and 19th centuries. He showed how political or military conflict led to an initial collapse and then a belated rise in novel writing, for instance in France after 1789 or in Milan during the wars of the late 1840s. In a study of the English novel, he showed how each genre (for example, the sentimental or the Gothic novel) was in favour for about 25 to 50 years, before being superseded by another genre.3 Moretti’s approach (not least his claim that ‘quantitative research provides a type of data that is ideally independent of interpretations’)4 is controversial and has attracted charges of positivism. In response, he and his defenders argue that ‘distant reading’ draws attention to the ‘Great Unread’, allowing the representativeness of the literary canon to be evaluated against the thousands of other novels produced in the period.5
For music historians there already exist large bibliographical datasets, including the catalogues of research libraries and the various inventories created by RISM (Répertoire International des Sources Musicales). The British Library’s catalogue of printed music, for instance, describes works by more than 100,000 composers. Such large datasets offer the possibility for a ‘distant reading’ of musical sources, directing scholarly attention away from the canonized composers that are the usual object of research and instead highlighting long-term trends. By extending the scope of musicological study in this way, we open the possibility of exploring what might be called (to adapt Moretti’s term) the ‘Great Unheard’. A further advantage of investigating bibliographical datasets is that they already possess structure, having been created according to the rules of library cataloguing; they therefore may be easier to manipulate and analyse than other large datasets available to musicologists (such as libraries of audio files).
Our project A Big Data History of Music, a collaboration between Royal Holloway and the British Library, has explored how such large bibliographical datasets may open new avenues for research into music history. The first phase of the project cleaned and enhanced various aspects of the British Library’s catalogues of printed and manuscript music. The second phase piloted techniques for analysing large datasets, in order to examine large-scale trends in music history and to use visualizations to test and develop hypotheses. This article introduces the datasets used in the project, and describes some of the results gained in the second phase of the project. It is hoped this account will whet the appetite of readers to explore the datasets and undertake similar analyses themselves.
Our project worked with several datasets, whose characteristics and limitations will be briefly described here. The British Library’s catalogue of printed music (search interface at http://explore.bl.uk) contains over a million records, describing publications between 1500 and the present day. The British Library has a copy of most music published in Britain, acquired as a result of legal deposit legislation, and a vast collection of material from elsewhere; however, much popular and ‘light’ music of the 20th century still has not been added to the catalogue. Catalogue entries vary markedly in their level of detail, having been accumulated over two centuries by cataloguers working to different standards. Some old records give little more than the title of the book and the place of publication, whereas the records for 16th-century anthologies have been recently upgraded to include transcriptions of title-pages and full inventories of contents. Information such as place of publication and name of publisher is recorded in the form given on the copy, so can vary enormously; thus the location ‘Lyon’ may be recorded in such variants as ‘Leon’, ‘Lions’, ‘Lugduni’ or ‘Lubduni’. The dating of much 18th- and 19th-century printed music is conjectural—often cataloguers assigned these publications to a round date such as a new decade—and therefore cannot be relied on for a year-by-year chronological analysis.
Regarding the British Library’s catalogues of manuscript music, the project primarily worked with a digitized version of Augustus Hughes-Hughes’s Catalogue of manuscript music in the British Museum (London, 1906–9), which until now has not been available electronically. The dataset derived from Hughes-Hughes’s catalogue contains more than 35,000 records, each describing an individual composition in a manuscript, with details of genre and composer where known. Unlike the catalogue of printed music, information on the place of origin is rarely given. Both British Library datasets are freely available for download from www.bl.uk/bibliographic/download.html as CSV (comma-separated value) files, for users who wish to work in software such as Excel. The catalogue of printed music is also available in RDF/XML; RDF (the Resource Description Framework) enables the exchange and reuse of data on the web, giving users the opportunity to combine this dataset with other resources.
Also used in the project were RISM datasets, particularly its inventories of early printed music before 1800: RISM A/I contains about 100,000 records describing editions holding the work of a single composer; RISM B/I contains about 17,000 records for anthologies (containing works by more than one composer). As the product of an international cataloguing effort, RISM A/I and B/I have a much wider geographical scope than the British Library catalogues. Although their coverage of Eastern European and Iberian libraries is patchy, RISM A/I and B/I probably list up to 80 per cent of extant printed editions worldwide. Like the British Library’s catalogue of printed music, RISM A/I and B/I contain information on places of publication; dates of publication are given only when included on the copy, meaning much music printed after 1700 has no date allocated to it. The final dataset used was RISM A/II, which contains over 900,000 records describing manuscripts originating between c.1500 and c.1850, often catalogued to a high level of detail, with information on constituent compositions. Its geographical coverage is strongest for German- and English-speaking lands; it has relatively few contributions from French, Italian or Iberian libraries, which have preferred to catalogue their manuscript holdings in national bibliographical initiatives. Since 2012, RISM A/II has been available as open data from http://opac.rism.info, and from May 2015 all of RISM A/I and a small portion of B/I can also be consulted via this site.6
Most of the project datasets were obtained in the format of library catalogue records (marc21), from which spreadsheets of data were exported using the tool MarcEdit (http://marcedit.reeset.net). In the initial phase of data cleaning, particular attention was given to facets such as the places and dates of publication, as these are often recorded in variant forms that can thwart automated analysis. Data cleaning and alignment were again an important part of the second phase of the project, because an excerpt of data rarely has sufficient consistency to be immediately suitable for analysis. Once a dataset has been prepared for analysis, it can be manipulated and visualized with a variety of tools, ranging from Excel spreadsheets to open-source software such as the R Project for Statistical Computing (www.r-project.org/).7 The following sections describe case studies explored in the project, showing how the analysis of large datasets allows new ways of studying music history.
The rise and fall of music publishing, 1500–1700
Quantitative analysis can allow musicologists to detect long-term trends, for instance involving the formation of musical markets and musical taste across centuries. Once a long-range development has been detected, it is possible to identify the individual items that contribute to this trend; such dynamic switching of focus between macro- and micro-scale is one of the most powerful aspects of Big Data analysis, although hard to capture in a journal article such as this. As an example, we analysed the rise and fall of music publishing in the 16th and 17th centuries, using data from RISM A/I and B/I. As mentioned above, the RISM datasets are reasonably comprehensive and have a high degree of chronological accuracy: typeset printed music of the 16th and 17th centuries is usually dated to a specific year on its title-page, unlike the engraved or lithographed music of later eras. Spellings of place names in the dataset required standardization, and geographic co-ordinates were added to facilitate the production of maps. The following analysis was then done in Excel using a spreadsheet of over 16,000 bibliographical entries.
Viewed decade-by-decade (illus.1), the RISM data shows the rise of European music publishing across the 16th century, albeit with a plateau in the 1570s and a brief dip in the 1590s. Music publishing reached a peak in the 1610s, during which decade approximately 1,800 editions of music were printed. Such an increase in printing constituted a paradigm shift in how composers disseminated their works in the 16th and early 17th centuries. Yet in the 1630s music printing suddenly declined, and for the rest of the 17th century the industry operated at about half its previous level of intensity, with never more than about 900 publications surviving from each decade.
The red and blue shadings in illus.1 show anthologies versus single-composer editions respectively. Anthologies dominate the early years of music printing, suggesting the entrepreneurial role of publishers and editors in this emerging industry. In the 1540s anthologies still accounted for about half of all printed music, but thereafter their number remained static at approximately 200 per decade. The subsequent growth in printed music entirely comprised single-composer collections, suggesting that from the 1550s composers took more initiative in publishing their works for financial gain or as symbols of prestige and skill. This quantitative analysis supports Kate van Orden’s recent suggestion that in the early 16th century, ‘it is hard to presume … that print was a natural locus of [musicians’] authorial identity’, yet by the 1550s there was ‘a dramatic shift in the attitude of composers toward the [single-composer] book of music’.8
Having observed these large-scale trends, we can examine the data in closer detail. A year-by-year analysis (illus.2) shows that the plateau in European music publishing in the 1570s can be attributed to falls in 1571/2 and 1576/7. A chart of the output of the leading printing centres (illus.3) shows that substantial falls occurred in Venice during these years. Both dips can be attributed to external factors: in 1571 the war with the Turks (culminating in the Battle of Lepanto), and in 1576/7 the plague epidemic that killed about 30 per cent of the Venetian population.9 Almost 30 years ago, Tim Carter used RISM data to chart the publishing of secular music in late 16th-century Italy, and his graphs likewise showed the temporary dips caused by these Venetian crises in 1571 and 1576/7.10 Compared to Carter’s article, the advantages of a digital analysis lie in the ease with which the data can be manipulated, drilling down to expose the individual publications produced in Venice, yet also placing Venice within Europe-wide trends for music printing.
Turning to the fate of music printing in the 17th century, a Big Data approach allows long-term trends to be plotted and thereby raises questions about the social and economic factors shaping musical life. As illus.1 shows, letterpress music printing had a distinct lifespan, with a sharp decline in the 1630s. Such a profile conforms to Fernand Braudel’s comment on the life expectancy of industries before the modern era: ‘the typical pattern of a sharp rise followed by an abrupt fall can very easily be imagined as the probable profile, in the pre-industrial economy’.11 In the case of music printing, one reason for the ‘abrupt fall’ was that movable type could not represent the complexities of virtuoso vocal or instrumental music, and it was then partly superseded by manuscript dissemination in the 17th century.
Analysis of the RISM data can also show the geographical reconfiguration of music publishing in the mid-17th century. In the previous century, Venice dominated music printing, typically producing over half of the European output of printed music in each decade. Illus.4, charting the ten most productive centres of music printing in the 1570s, shows the lagoon city’s pre-eminence even in that crisis-ridden decade. Venice’s dominance ceased in the 1630s, partly because of local reasons such as another plague outbreak in 1630–2, but also because of deeper structural changes, as the focus of the European economy shifted away from the Mediterranean to centres with closer access to the Atlantic trade such as London, Paris and Amsterdam.12 These cities began to play a major role in European music printing from the 1650s (see illus.3), although by this stage the industry had fragmented. The number of printing centres increased, yet each typically had a smaller output and served a narrower market. Illus.5 shows the ten cities with the highest output of printed music in the 1690s. No longer was a single city dominant: instead London and Paris had equal importance, with just over 150 items of printed music each. Bologna was the third most productive centre of music printing, and Venice and Amsterdam were in fourth and fifth places respectively. Such analyses highlight trends spanning two centuries, showing the reconfiguration of music printing in response to the economic and musical changes of the 17th century.
The previous paragraphs have used broad brush-strokes to represent complex phenomena. It might be objected that counting publications is a crude measure: surely a book historian should distinguish between large volumes containing many compositions and single-sheet songs, between first editions and reprints, and between pricey folio editions and cheap octavo books? Clearly the analysis could be nuanced in many ways. Yet the advantage of a digital analysis is that it is easy to cross-refer at all times to the master-sheet of individual bibliographical entries, and if necessary to augment this data or change the selection for analysis. Such Big Data analyses add a wider perspective to musicological study, showing how the individual sources (which are the usual object of research) contribute to broader trends, and thereby highlighting the interplay between music and its economic, political and social environments.
Mapping dissemination, reception and canonization
Analysis of bibliographical data can also illuminate the dissemination and reception of the works of specific composers, as the following case studies on Giovanni Pierluigi da Palestrina and Henry Purcell show. Spreadsheets detailing the dissemination of their works, derived from relevant entries in RISM A/I, A/II, B/I or B/II, can be imported into web-based visualization services such as Google Fusion (http://google.com/fusiontables) or Palladio (http://palladio.designhumanities.org/). These can produce a map of geo-coded data, or create network diagrams that show the links between entities (for instance, between musical works and places). Such network diagrams can expose geographical or chronological trends in the circulation of music, raising questions about why certain works or genres gained importance while others remained outliers.
Illus.6 is a network diagram showing the places and decades where Palestrina’s music was published in the 16th and 17th centuries. It includes single-composer editions of Palestrina (as listed in RISM A/I) and anthologies containing works by Palestrina (as listed in RISM B/I). The diagram clarifies which locations were central or peripheral to the dissemination of his music, and shows some of the patterns in the posthumous publication of his music. The size of the nodes shows that the most important locations for the publication of Palestrina’s music (in terms of numbers of books) were Venice and Rome. Venice, as demonstrated above, had an overwhelmingly dominant position in European music printing of the 16th century; in the 1570s Palestrina favoured Scotto and Gardano as publishers, and Venetian firms offered reprints of many of Palestrina’s works initially printed in Rome. The second biggest node, Rome, was the place where the first editions of many of Palestrina’s sacred collections appeared, and as the centre of Tridentine church reform it remained important for the ongoing publication of his liturgical music.
So far, this analysis has confirmed the observations in Jane Bernstein’s 2007 article on Palestrina’s publishing strategy.13 Where the network diagram makes a distinct contribution is in clarifying the chronological and geographical extremes of the publishing of Palestrina’s music. It shows that after Palestrina’s death in 1594, his music continued to be published mainly in Catholic centres such as Rome (where single-composer editions of his hymns and offertories appeared until the 1620s, and anthologies with his music, notably Anerio’s four-voice arrangement of the Missa Papae Marcelli, appeared until the 1680s). Another centre for the posthumous printing of Palestrina’s music was Antwerp, which was forcibly re-catholicized after 1585.
Illus.6 furthermore shows the smaller publishing centres and peripheral locations where Palestrina’s music appeared. Given his reputation as an archetypal Catholic composer, it is not surprising that his music was printed in Counter-Reformation Milan and the Jesuit university town of Dillingen in Bavaria. The diagram also demonstrates the dates when Palestrina’s music was printed in Protestant locations—Nuremberg, Strasbourg and London in the 1580s, Heidelberg in the 1600s and Leipzig in the 1610s. Glimpsing such outliers can prompt an investigation into which genres and compositions travelled to Protestant locations: for instance, did madrigals in contrafacta or in wordless intabulations for lute travel better than Latin motets?
Unusually for composers of the 16th century, Palestrina’s music underwent a strong revival in the 18th and 19th centuries. The strength of this revival is indicated by the RISM A/II dataset of music manuscripts. Here the caveat must be repeated that RISM A/II is incomplete, with little coverage of French, Iberian or Italian holdings. These omissions notwithstanding, Table 1 shows the enormous increase in manuscript copies of Palestrina in the 18th and 19th centuries, an increase probably partly triggered by Fux’s veneration of Palestrina in his Gradus ad Parnassum (1725).14 Such statistics demand closer scrutiny, for instance an investigation of which of Palestrina’s compositions were copied most, or a study to see if other 16th-century composers such as Arcadelt or Lassus underwent a comparable revival. At the least, though, such quantitative analysis can open new avenues for research into reception history.
|16th century||17th century||18th century||19th century|
|16th century||17th century||18th century||19th century|
A similar set of research questions—investigating posthumous dissemination and canonization—can be addressed via an analysis of bibliographical data for the works of Purcell. William Weber has singled out Purcell and Corelli as the two 17th-century composers who were canonized in Britain in the 18th century.15 Yet how Purcell’s music became part of the musical canon has not been fully investigated. John Higney, in a 2008 thesis on the 18th-century dissemination of Purcell’s works, focused almost exclusively on secular vocal compositions. Using a sample of fewer than 200 publications, Higney observed a sharp decline in the publication of Purcell’s music in the second decade of the 18th century, with a slight upturn in the composer’s fortunes from 1770.16 Analysing a larger dataset—RISM series A/I, B/I and B/II, which together list 503 publications containing music by or attributed to Purcell, issued between 1680 and 1799—shows different results (illus.7). The rate of publication drops earlier and more sharply, with the number of publications falling from about 100 in the 1690s to fewer than 20 in the 1750s. An upturn during the last four decades of the 18th century reflects the growing interest in older repertory, fostered by publications such as William Boyce’s Cathedral Music (1760–73) and the formation of groups with an interest in earlier music, such as the Concerts of Ancient Music.17
One complication with illus.7 is the lack of secure dates of publication for many 18th-century editions, which frequently can be dated only to the nearest decade. It is possible that some of the publications dated c.1700 (which in this graph have been assigned to the decade 1700–10) actually date from the late 1690s, which would make the decline in the 1700s even more pronounced. The analysis can be nuanced further by identifying the genre (sacred or secular), the publishing format (single-composer edition or anthology) and publishing location (London or provincial towns); this can highlight patterns such as the tendency for Purcell’s sacred music to appear in anthologies. Having identified overall trends via a quantitative analysis, musicologists will want to probe the data to identify the constituent publications that contribute to these trends.
Data from the RISM A/II catalogue of music manuscripts sheds light on the posthumous performance history of specific works by Purcell. RISM A/II contains over 2,500 entries for compositions of Purcell’s in manuscript, over half of which represent sacred music (predominantly anthems). This is a testimony to the work of RISM UK in cataloguing cathedral and chapel libraries, although there are gaps (notably for various Oxford and Cambridge institutions, and also for Westminster Abbey). In 2010 Rebecca Herissone commented that ‘the part-books and other records surviving for many religious institutions from the 18th century onwards … have the potential to establish whether, for example, performance of [Purcell’s] sacred repertory became restricted to a small number of works ... or whether it was influenced by printed editions such as that of Boyce’.18
Illus.8 shows how data analysis can answer Herissone’s questions. This network diagram shows Purcell anthems held in cathedral and chapel libraries, generally in the form of performing parts from the 18th century. The diagram shows a core repertory performed in many places, including pieces such as Be merciful unto me O God, I was glad and My song shall be alway that survive in 14 or more religious institutions. By contrast, a large number of anthems appear to have been performed at a relatively small number of cathedrals, including Out of the deep and My beloved spake. The RISM data suggests that many of Purcell’s sacred works published in Boyce’s Cathedral Music (such as the Service in By or the anthem Be merciful unto me) had already circulated widely in manuscript before this edition appeared. A network diagram such as illus.8 can prompt further enquiries, for instance to identify features of the music or text that made anthems such as I was glad so popular, whereas pieces such as Out of the deep remained peripheral. Similar studies could be done for other composers whose works are well represented in RISM A/II (such as J. S. Bach), allowing the existing bibliographical data to be harnessed to answer questions about reception and canonization.
New directions for research
A Big Data approach also allows music history to be studied in new ways. Currently much of the infrastructure for researching music history (for example, Grove Music Online) is built primarily to be searched by the name of the composer. Yet datasets such as RISM A/I or the British Library’s catalogue of printed music also include partial transcriptions of title-pages, and such data can be searched for keywords that may indicate scoring (for example, ‘piano’) or ethnic colourings (for example, ‘Hungarian’) or may refer to historical events and figures (for example, ‘Waterloo’, ‘Victoria’).
As an example of these possible new avenues, illus.9 offers an investigation of the rise of ‘Scottish’ music. The chart shows publications between 1680 and 1899 in the British Library with a version of the keyword ‘Scottish’ (for example, Scots, Schottisch) in the title. Much of this music is not by Scottish composers, but projects a stereotyped English view of Scottish identity. Once again, the dating of printed music from this period is often uncertain; many of these ‘Scots’ publications have been assigned to an approximate decade by cataloguers. (Given these uncertainties, it would be interesting for a future study to analyse the holdings of other libraries such as the Library of Congress or the National Library of Scotland, although currently their metadata is not openly available.) Illus.9 shows that ‘Scots’ songs were present as a low-level publishing phenomenon in England from the 1680s onwards, initially consisting mainly of single songs performed on the London stage. Numbers of ‘Scots’ songs rose from the 1770s onwards, with a peak in the 1820s. Some reasons for this increase can be found by examining the constituent data, which from the 1780s includes settings of James Macpherson’s ‘Ossian’ publications; these ballads, supposedly recovered from a Pictish bard, sparked a craze for the noble primitivism of the ancient savage. Also prominent in the 1800s were settings of the poetry of Walter Scott, whose romanticized evocations of Scottish history achieved immense popularity.
Illus.9 shows that the output of ‘Scots’ music more than halved in the 1830s, even though British music publishing as a whole was buoyant in that decade. Perhaps the craze for Ossianic poetry was starting to subside after it was exposed as a forgery, or perhaps as Scotland’s place in the political union became more secure, there was less need to fabricate ‘Scots’ music.19 Equally, this dip may be a quirk caused by the uncertain dating of much of the material. However, the quantity of ‘Scots’ music recovered from the 1840s onwards. In 1842 Queen Victoria made her first visit to Scotland and the scenery received royal approval from Prince Albert as ‘very German-looking’;20 in 1852 Balmoral became a royal residence. An immediate musical response can be seen in such publications as Henry Bishop’s setting of Scotia’s welcome to Victoria or a collection such as The Balmoral Quadrilles as Performed at the Palace.21 The historian Hugh Trevor-Roper famously showed that many aspects of Highland Scottish culture—the kilt, tartan and bagpipes—were in part inventions of the late 18th and early 19th centuries.22 Our analysis of bibliographical data can chart the musical counterpart of this invention of tradition. Similar analyses could probe the rise of other national flavours in music of the 19th century, for instance music that described itself as Czech, Hungarian or Polish.
Quantitative analysis of bibliographical data offers challenges and opportunities to historical musicologists. For a discipline increasingly focused on small-scale case studies, a data-driven approach offers an opportunity to glimpse how individual manuscripts or printed editions might relate to wider trends. As shown in the case studies on the rise and fall of music publishing and on the rise of ‘Scottish’ music, such long-term trends may illuminate the interplay between music and broader social, economic or political changes.
Contrary to Moretti’s assertion that the data should be ‘ideally independent of interpretations’, any quantitative analysis of bibliographical data is inevitably shaped by the assumptions and scholarly horizons of the cataloguers who contributed to such datasets as the RISM bibliographies and the British Library catalogues. These bibliographical resources have been assembled over many decades, often to changing standards of cataloguing, with information about composer attributions and conjectural datings reflecting the fluctuating state of scholarly knowledge. Furthermore, most of these datasets are subject to ongoing revision and updating, particularly RISM A/II which has up to 30,000 new records added to it each year. Analysts need to be aware of the dynamic nature of the datasets and ensure they use an up-to-date version if possible.
For all these reasons, Moretti’s ideal of an objective quantitative analysis cannot be attained. Instead, any informed user of musical-bibliographical Big Data must confront the histories and idiosyncrasies of datasets such as RISM. Data analysis cannot supplant older methods of musicological study but should be used symbiotically with them: manipulating and visualizing data can show trends not otherwise evident from individual manuscript or printed sources, but should also send us back to those original sources with renewed interest.