All the World’s a (Hyper)Graph: A Data Drama

We introduce H YPERBARD , a dataset of diverse relational data representations derived from Shakespeare’s plays. Our representations range from simple graphs capturing character co-occurrence in single scenes to hypergraphs encoding complex communication settings and character contributions as hyperedges with edge-speciﬁc node weights. By making multiple intuitive representations readily available for experimentation, we facilitate rigorous representation robustness checks in graph learning, graph mining

Cre.What canny creatures met my febrile mind.That friendly faun, the gentle spirit, exchanging such profound considerations.I wish I could have stayed a little longer-instead, I'm left to draw my own conclusions.What graph shadows could I create by shining different lights on what there is?It seems the sensible depends on the semantics.
They close their eyes, following their thoughts.
Cre.When we transform reality to math, Graphs are but outputs, in-phenomena.
The myriad transformations that we see, How do they differ systematically?
For now, we shall distinguish three dimensions.
First, our semantic mapping-Nodes and edges: What types of entities do we assign?
Second, our granularity-What are Our modeling units for semantic mapping?
And third, our expressivity: What more Do we attach to all our modeling units?
Tracing coordinate axes with their fingers, they sigh.Cre.All these distinctions, it appears, are known in the Community [18].And yet, the knowledge seldom heeded-graph data shelves are filled with all these captive singular truths.We hardly hold what that free faun foresaw: For every data point, a set of transformations as its data.I wonder why.
Col. Alas, they really want documentation?
CREATURE steps into the door frame, unnoticed.Col.A datasheet [7]?Well-all the world is data, And all we care for merely data points; They get created, updated, deleted, And every data point plays many parts, Its fate being seven stages.First, motivation Defining purpose or specific tasks.Accounting now for expressivity, These edges may be binary or multi, Or weighted by lines spoken, Fig. 3b.
The outcome, evident from Fig. 3c, Is far from what we had initially.
Thus, even for just one semantic mapping, And R. and J. as a specific case: We see at least six decent transformations, Statistics differing tremendously.
Hyp.So is this all?
Cre. Oh, that is but the start!Thus far, we've had just characters as nodes.
One possible complaint with this approach Is that it gives us artificial cliques.
Instead, we could in our semantic mapping Consider also parts of plays as nodes, Transforming plays into bipartite graphs, Whose edges signal character occurrence.
Then granularity, Fig. 4a-b, Concerns the nodes, but sometimes also edges.
In terms of expressivity, we could Again attend to weights, and represent Directionality, see Fig. 4c, With greater ease than in the one-mode case-To model single speech acts, too, as edges.
Hyp. Now, that is quite a lot-so are you finished?
Cre. Respectfully, the best is yet to come!Conceptually, all I have just described Can be derived from a more general model.Then what remains?Cre.A set system-a hypergraph, they say [4],

All graphs, regarding expressivity
We visualize its power in Fig. 6.

Confusingly: All graphs are hypergraphs
But not vice versa.

Hyp.
Do we need this, GRAPH?Gra.Well, some found hypergraphs to be quite handy To capture higher-order interactions [1,2,3].

They certainly are more intuitive
Than making cliques of multi-arities, Or else treating relations, too, as nodes.
Cre.We can go far with graphs but don't know yet Just how much further we can get with hyper.
Observe the beauty in these hypergraphs: They readily entail all transformations!From their perspective, what first we discussed Are clique expansions, and our next ideas Are known as star expansions [17]-see, in sum,      Col.Those fecund thoughts shall find their fertile soil.They empty the content of the jar onto the bonsai.
Col. To ashes, ashes-dust to dust.Not me-Thus goes the system, let the system be.
Exit.GRAPH caresses the bonsai.Gra.Full many a transformation have I seen Flatter the flora with their sovereign hand, And sovereign's hand in spirit I'll have been

A.1 Datasheet
Our documentation follows the Datasheets for Datasets framework [7], omitting the questions referring specifically to data related to people. 1 For conciseness, unless otherwise indicated, the term graph refers to both graphs and hypergraphs.

A.1.1 Motivation
For what purpose was the dataset created?Was there a specific task in mind?Was there a specific gap that needed to be filled?Please provide a description.HYPERBARD was created to study the effects of modeling choices in the graph data curation process on the outputs produced by graph learning, graph mining, and network analysis algorithms.
There was no specific task in mind; rather, all classic graph learning, graph mining, and network analysis tasks were considered to be in scope.These tasks include, e.g., centrality ranking, outlier detection, clustering, similarity assessment, and standard statistical summarization, each for nodes, edges, and graphs, as well as variants of node classification, link prediction, or graph classification.
HYPERBARD was designed to fill a specific gap: Although there were myriad freely available graph datasets, to the best of our knowledge, none of them contained -several different relational data representations, -of the same underlying raw data, -derived in a principled and well-documented manner, -from each of several raw data instances belonging to a natural collection, -where the raw data is intuitive and interpretable.
Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)?Corinna Coupette and Bastian Rieck created the dataset as part of their research.
Who funded the creation of the dataset?If there is an associated grant, please provide the name of the grantor and the grant name and number.The creation of the dataset was indirectly funded by the institutions employing the dataset authors, i.e., the Max Planck Institute for Informatics (Corinna Coupette) and the Institute of AI for Health, Helmholtz Munich.There are no associated grants.
Any other comments?None.

A.1.2 Composition
What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)?Are there multiple types of instances (e.g., movies, users, and ratings; people and interactions between them; nodes and edges)?Please provide a description.Each instance represents a play attributed to William Shakespeare as a graph, and there are multiple different graph representations per play.In some graphs (i.e., hypergraphs and graphs derived from clique expansions of hypergraphs), nodes represent characters, and (hyper)edges represent that characters were on stage at the same time in some part of the play.In other graphs (i.e., graphs derived from star expansions of hypergraphs), nodes represent characters or parts of a play, and an edge indicates that a character was on stage in that part of the play.The representations provided differ not only in their semantic mapping (what are the nodes and edges) but also in their granularity (what parts of the play are modeled as edges resp.nodes) and in their expressivity (what additional information is associated with nodes and edges); see Table 1 in the HYPERBARD paper.
How many instances are there in total (of each type, if appropriate)?There are 37 plays in the raw data; 17 comedies, 10 historical plays, and 10 tragedies.Each play is represented as a graph in (at least) 18 different ways, for a total of 666 graph representations.Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set?If the dataset is a sample, then what is the larger set?Is the sample representative of the larger set (e.g., geographic coverage)?If so, please describe how this representativeness was validated/verified.If it is not representative of the larger set, please describe why not (e.g., to cover a more diverse range of instances, because instances were withheld or unavailable).The dataset contains graph representations of all plays attributed to William Shakespeare by the Folger Shakespeare Library (see https://folgerpedia.folger.edu/William_Shakespeare%27s_plays),with the exception of lost plays and the comedy The Two Noble Kingsmen-a collaboration between Shakespeare and John Fletcher that is not currently provided in the TEI simple format by Folger Digital Texts.
What data does each instance consist of?"Raw" data (e.g., unprocessed text or images) or features?In either case, please provide a description.Each instance, i.e., each of Shakespeare's plays, is represented by a set of files: one raw data file containing the text of the play as an XML encoded using the TEI Simple format, taken from Folger Digital Texts without modification, three CSV files containing preprocessed data, and 19 CSV files containing node lists and edge lists to construct different graph representations.
Consequently, dataset is distributed using the following folder structure: rawdata: contains 37 raw data XML files encoded in TEI simple.
data: contains 3•37 preprocessed data files derived from files in rawdata.
graphdata: contains 19•37 node and edge lists to construct graph representations from the files in data.
Is there a label or target associated with each instance?If so, please provide a description.
There are labels corresponding to the type of play (one of {comedy, history, tragedy}), which could be used to partition the data for exploration, or as targets in classification tasks.
Is any information missing from individual instances?If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable).This does not include intentionally removed information, but might include, e.g., redacted text.
There is no missing information.
Are relationships between individual instances made explicit (e.g., users' movie ratings, social network links)?If so, please describe how these relationships are made explicit.
When considering plays as instances, no relationships between individual instances are made explicit.When considering characters or parts of plays as instances, however, relationships between characters, or between characters and parts of plays are made explicit in the graph representations, exploiting the TEI Simple encoding of that data and the annotations provided in the XML attributes.
Are there recommended data splits (e.g., training, development/validation, testing)?If so, please provide a description of these splits, explaining the rationale behind them.
There are no recommended data splits for the current release.
Are there any errors, sources of noise, or redundancies in the dataset?If so, please provide a description.
The raw data contain some errors and redundancies in the XML encoding.Errors include redundant XML tags (e.g., doubly-wrapped <div> tags), but also character entries or exits not explicitly annotated.Redundancies result from the choice, made by the creators of Folger Digital Texts, to encode some information conveyed in the raw text also as attributes or separate XML tags (e.g., a character who speaks is encoded both as an attribute of the tag wrapping the speech and as an XML tag wrapping the name of the speaker).
There are two notable sources of noise affecting the preprocessed data and the graph data, both of which relate to our handling of stage directions-i.e., our processing of the XML attributes of <stage> tags in the raw data.First, to determine which characters are on stage when a word is spoken, we primarily rely on the contents of who attributes in the <stage> tags of the raw data marked with type="entry" resp.type="exit".The who attributes, however, are sometimes semantically incomplete, i.e., they may reflect Shakespeare's original stage directions accurately, but the original stage directions do not mention implied character movements (such as the exit of a side character or the exit of characters that died or fell unconscious at the end of a scene).To limit the impact of this noise source on our graph representations, we "flush" characters when a new scene starts (to handle missing exits) and ensure that the speaker is always on stage (to handle missing entries, some of which are also introduced by our character flushing policy).
Second, in our directed graph representations, where edges encode speaking and being spoken to, we equate being on stage while a word is spoken with hearing the word.Thus, we do not account for the impact of some stage directions concerning delivery, e.g., stage directions indicating that speech is inaudible for some or all other characters on stage, on the information flow our directed graph representations purport to capture.In the TEI simple encoding of our raw data, such stage directions are annotated with type="delivery", but there is no indication of who can hear the words so delivered in the XML annotations.There are 2 200 XML tags annotated with type="delivery" (i.e., 60 delivery modifications per play on average).As modifications to delivery are sometimes crucial to drive the plot (e.g., by setting up misunderstandings), the impact of this noise source should not be underestimated, but it affects only our directed graph representations, which might be cautiously interpreted as "upper bounds" on the information flow between the characters on stage.
These sources of noise detailed above could likely be eliminated, to a large extent, by a more sophisticated parsing of the stage directions.This parsing could leverage, e.g., natural language processing methods to supplement the XML annotations.We plan to implement this improvement for a future dataset release.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)?If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer?Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.
The dataset is self-contained.The raw data stem from Folger Digital Texts, maintained by the Folger Shakespeare Library and released under the CC BY-NC 3.0 Unported license, and they are redistributed without modifications as part of the HYPERBARD dataset.All other data are derived from the raw data, and the CC BY-NC 3.0 Unported license does not impose any additional restrictions.
As part of our dataset maintenance (see below), we will regularly check Folger Digital Texts for modifications, and we will recompute and redistribute an updated HYPERBARD dataset under a versioned DOI whenever we detect changes.
Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)?If so, please provide a description.
The dataset does not contain data that might be considered confidential.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety?If so, please describe why.
The raw data, i.e., Shakespeare's plays, contain scenes that might be considered offensive, insulting, threatening, or otherwise anxiety-inducing from a contemporary perspective.For example, there is considerable controversy in the humanities around whether The Taming of the Shrew is misogynistic, and the main female protagonist's final speech on female submissiveness (Act V, Scene 2, ll.136-179) might cause discomfort to modern readers.Moreover, the corpus uses words that might be considered derogatory or offensive from a contemporary perspective.The preprocessed data, however, disassembles the original text, such that (offensive) play content is no longer immediately apparent when the data is viewed directly.

Any other comments?
The entire dataset takes up roughly 365 MB when uncompressed, and 30 MB when compressed.

A.1.3 Collection Process
How was the data associated with each instance acquired?Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)?
The raw data associated with each instance was acquired from Folger Digital Texts as XML files encoded in TEI Simple format.This format contains both raw text and structural, linguistic, and semantic annotations embedded in XML tags or XML attributes.Hence, it was partially directly observable (e.g., the raw text and its structure) and partially derived from other data (e.g., the XML tags and their attributes).The preprocessed data and the graph data were derived from the raw data.
If the data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified?If so, please describe how.
To the extent that the raw data were indirectly inferred or derived from other data, validation was performed by the specialists from Folger Digital Texts.The preprocessed data and the graph data were validated by unit tests and manual inspection aided by visualizations (which also led us to discover the noise sources detailed above).
What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)?How were these mechanisms or procedures validated?
The raw data was bulk downloaded in TEI Simple format as a ZIP archive from the Folger Digital Texts downloads section, and Folger Digital Texts compiled the raw data through computer-assisted manual curation.The bulk download was checked manually to ensure that the extracted archive contained one XML file per play, as expected.The code creating the preprocessed data from the raw data and the graph representations from the preprocessed data is almost completely unit tested.
If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?
The data is not a sample from a larger set.
Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?
Only Corinna Coupette and Bastian Rieck, the dataset authors, were involved in the data collection process.
Over what timeframe was the data collected?Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)?If not, please describe the timeframe in which the data associated with the instances was created.
The raw data was collected through one download call to https://shakespeare.folger.edu/downloads/teisimple/shakespeares-works_TEIsimple_FolgerShakespeare.zip in June 2022, and the preprocessed data and the graph data were derived from the raw data by running a code pipeline, also in June 2022.This timeframe does not match the creation timeframe of the raw data, which, though internal to the Folger Shakespeare Library, spans at least several months in 2020.It also does not match the creation timeframe of Shakespeare's plays, which spans several decades in the 16th and 17th centuries.
Were any ethical review processes conducted (e.g., by an institutional review board)?If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.
No ethical review processes were conducted.
Any other comments?None.
A. All data preprocessing can be completed in a couple of minutes even on older commodity hardware.We used a 2016 MacBook Pro with a 2.9 GHz Quad-Core Intel Core i7 processor and 16 GB RAM.

A.1.5 Uses
Has the dataset been used for any tasks already?If so, please provide a description.
In the paper introducing HYPERBARD, the dataset has been used to demonstrate the differences between rankings of characters by degree that result from different modeling choices made when transforming raw data into graphs.
Is there a repository that links to any or all papers or systems that use the dataset?If so, please provide a link or other access point.
Papers or systems known to use dataset will be collected on https://hyperbard.net and on GitHub.
What (other) tasks could the dataset be used for?HYPERBARD was designed for inquiries into the stability of algorithmic results under different reasonable representations of the underlying raw data, i.e., to enable representation robustness checks for graph learning, graph mining, and network analysis methods.In this role, it could generally be used for all graph learning, graph mining, and network analysis tasks identified as in scope in the motivation section.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses?For example, is there anything that a dataset consumer might need to know to avoid uses that could result in unfair treatment of individuals or groups (e.g., stereotyping, quality of service issues) or other risks or harms (e.g., legal risks, financial harms)?If so, please provide a description.Is there anything a dataset consumer could do to mitigate these risks or harms?
The quality and expressivity of the dataset is limited by the quality and expressivity of Folger Digital Texts encoded using the TEI Simple format, which could restrict usage in the digital humanities, e.g., when they are interested in the minute details of character interactions described in stage directions.
HYPERBARD contains relational data representations of Shakespeare's plays, which were written more than four centuries ago.Hence, there are no risks or harms associated with the dataset beyond the risks or harms also associated with the ongoing study of Shakespeare's works in the humanities, and the risks or harms associated with the decontextualization or overinterpretation of any dataset.
At https://hyperbard.net and on GitHub, we keep a continuously-updated list of all known dataset limitations for dataset consumers to review when deciding whether HYPERBARD is appropriate for their use case.
Are there tasks for which the dataset should not be used?If so, please provide a description.
Outside representation robustness checks, HYPERBARD should not be used in tasks that have no reasonable semantic interpretation in the domain of the raw data.
Any other comments?None.

A.1.6 Distribution
Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created?If so, please provide a description.
The dataset was not created on behalf of any entity, and it will be distributed freely.
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)?Does the dataset have a digital object identifier (DOI)?
The dataset will be distributed as a ZIP archive via Zenodo, based on code hosted on GitHub.Each dataset version and each code release will have a versioned DOI, generated automatically by Zenodo.See also Table 2.
When will the dataset be distributed?The dataset will be distributed when the paper introducing it is submitted.
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)?If so, please describe this license and/or ToU, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms or ToU, as well as any fees associated with these restrictions.The dataset will be distributed under a CC BY-NC 4.0 license, according to which others are free to share, i.e., copy and redistribute, and adapt, i.e., remix, transform, and build on the material, provided they give attribution, i.e., give appropriate credit, provide a link to the license, and indicate if changes were made, -do not use the material for commercial purposes, and add no restrictions limiting others in doing anything the license permits.
The code constructing the dataset will be distributed under a permissive BSD 3-Clause license.
Have any third parties imposed IP-based or other restrictions on the data associated with the instances?If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions.
The Folger Shakespeare Library has released the source of our raw data, Folger Digital Texts, under the CC BY-NC 3.0 Unported license, which has essentially the same usage conditions as our CC BY-NC 4.0 license.
Do any export controls or other regulatory restrictions apply to the dataset or to individual instances?If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any supporting documentation.
No export controls or other regulatory restrictions apply.
Any other comments?None.

A.1.7 Maintenance
Who will be supporting/hosting/maintaining the dataset?Corinna Coupette and Bastian Rieck will be supporting, hosting, and maintaining the dataset.
How can the owner/curator/manager of the dataset be contacted (e.g., email address)?
In the interest of transparency, the preferred method to contact the dataset maintainers is by opening GitHub issues at https://github.com/hyperbard/hyperbard.Alternatively, the dataset maintainers can be reached by email to info@hyperbard.net Is there an erratum?If so, please provide a link or other access point.
Errata will be documented at https://hyperbard.net and on GitHub.
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)?If so, please describe how often, by whom, and how updates will be communicated to dataset consumers (e.g., mailing list, GitHub)?
The dataset will be updated as needed, and updates will be labeled using semantic versioning.
-A patch version (e.g., 0.0.1 → 0.0.2) is a recomputation of the latest dataset version following a non-breaking change in the underlying raw data.
-A minor version (e.g., 0.0.1 → 0.2.0) is an update of the latest dataset version that increases the expressivity of existing representations while maintaining all of their previously present features.
-Any other update is a major version (e.g., 0.0.1 → 1.0.0).This includes, e.g., responses to breaking changes in the underlying source data, additions of new representations, and changes to existing representations that might break dataset consumer code.
Patch versions will be created automatically using GitHub actions.Minor versions and major versions will be created by the dataset maintainers, potentially accepting pull requests or implementing feature requests filed via at https://github.com/hyperbard/hyperbard.New releases will be communicated at https://hyperbard.net and on GitHub, and they will be available for download under a versioned DOI on Zenodo, with 10.5281/zenodo.6627158alwaysresolving to the latest release.
If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)?If so, please describe these limits and explain how they will be enforced.
There are no data retention limits.
Will older versions of the dataset continue to be supported/hosted/maintained?If so, please describe how.If not, please describe how its obsolescence will be communicated to dataset consumers.
Older versions of the dataset will remain hosted on Zenodo, with the relevant version of the code needed to reproduce them available in an associated GitHub release, also archived on Zenodo.
There will be basic support for older versions of the dataset, and as HYPERBARD is derived from century-old literary works, dataset maintenance amounts to dataset updates (see the paragraph on dataset updates).
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so?If so, please provide a description.Will these contributions be validated/verified?If so, please describe how.If not, why not?Is there a process for communicating/distributing these contributions to dataset consumers?If so, please provide a description.Others can extend, augment, build on, and contribute to the dataset through the engagement mechanisms provided by GitHub.See also https://github.com/hyperbard/hyperbard/blob/main/CONTRIBUTING.md.
Extensions, augmentations, and contributions provided via pull requests will be validated and verified by the dataset maintainers in a regular code and data review process, while changes made in independent forks will not be checked.
Contributions integrated with the HYPERBARD code repository will be visible on GitHub, and they trigger new dataset releases, in which contributors will be specifically acknowledged.

Any other comments?
None.

B Usage Documentation
The HYPERBARD dataset is distributed in four folders: rawdata, data, graphdata, and metadata.See Section A.1.2 for more details on the composition of the dataset.The dataset can be reproduced by cloning the GitHub repository and running make (this will also generate most figures included in the HYPERBARD paper).
In addition to the written documentation, we provide Jupyter notebook tutorials for interactive data exploration.The tutorials are hosted on GitHub at https://github.com/hyperbard/tutorials,and they can be run both locally and in a Binder, i.e., a fully configured remote environment accessible through the browser without any local setup.Launching the Binder usually takes around thirty seconds.
In the following, we explain the structure of the files in HYPERBARD's folders and detail how these files can be read.All file examples are taken from Romeo and Juliet, and for CSV files, all columns are described in alphabetical order.

B.1 rawdata
This folder contains XML files encoded in TEI Simple as provided by Folger Digital Texts.These files can be read with any XML parser, such as the parser from the beautifulsoup4 library in Python.All file names follow the pattern {play}_TEIsimple_FolgerShakespeare.xml.
The XML encoding is designed to meet the needs of the (digital) humanities, and hence, it is very detailed and fine-grained.For example, every word, whitespace character, and punctuation mark is contained in its own tag.
The encoding practices followed by Folger Digital Texts are described in the <encodingDesc> tag of each text.To summarize: -The major goal of the TEI Simple encoding is to achieve interoperability with a large corpus of early modern texts derived from the Early English Books Text Creation Partnership transcriptions (i.e., it is different from our goal).
-The encoding is completely faithful to the readings, orthography, and punctuation of the source texts (i.e., the Shakespeare texts edited by Barbara Mowat and Paul Werstine at Folger Shakespeare Library).
-All xml:ids are corpuswide identifiers (i.e., they are unique across all our plays, too).
-Words, spaces, and punctuation characters are numbered sequentially within each play, incremented by 10 (XML attribute: n).
-Most other elements begin with an element-specific prefix, followed by a reference to the Folger Through Line Number, a sequential numbering of the numbered lines in the text.(Details omitted.) -Spoken words are linguistically annotated with a lemma and POS tag.

B.2 data
This folder contains CSV files, which can be read with any CSV parser, such as the parser from the pandas library in Python.

B.2.1 {play}.cast.csv
A {play}.cast.csvfile contains the XML identifiers and attributes of all <castItem> tags found in a {play}_TEIsimple_FolgerShakespeare.xmlfile.It gives an overview of the characters occurring in a play, and it can be used to count the number of characters (including characters that do not speak) or to build a hierarchy of characters and character groups.
Rows correspond to characters or character groups.
Columns in alphabetical order: corresp: group (i.e., another cast item) to which a given cast item belongs, if any (XML attribute abbreviating "corresponds").Type: String or NaN (if the cast item does not belong to any other cast item).
xml:id: unique identifier of the cast member.Type: String.
Note that the data in each of these columns does not start with a # sign.This contrasts with references to the xml:ids in the attributes of other XML tags in the raw data XML files, which do start with a # sign (to indicate the referencing). Example:

B.4 metadata
This folder currently contains exactly one CSV file, which maps play identifiers to play types.The file can be read with any CSV parser, such as the parser from the pandas library in Python, but since its provenance is documented as a comment at the start of the file, the # character needs to be passed to the parser as a comment character.
Rows correspond to plays.

Columns in alphabetical order:
-play_name: The name of the play, as used to fill the {play} placeholder in all play-specific file names.Type: String.-play_type: The type of the play.One of {comedy, history, tragedy}.

Force ' ∈
{1, 2}' on cardinality Of edges-Hyp.Marvelous mathematically!Cre.But artificial, thinking critically.The interactions in your vivid woods-How many of them are bilateral?This common cardinality constraint: Let's do away with it!Hyp.

Fig. 5 ,Figure 2 :Figure 3 :
Fig. 5, and our proposals in Tab. 1. Hyp.Things hyper, in their generality, They seem to suit my woods quite naturally.Gra.But sovereign, as a practicality, There's hardly any software letting us Compute with hypergraphs conveniently!Hyp. and Cre.[in synchrony] Who are you, the Community?Gra.I'm sorry.Exeunt.

Figure 4 :Figure 5 :
Figure 4: Weighted bipartite graph of named character occurrences in Act III of Romeo and Juliet, resolved at the scene level (a) and at the stage group level (b), as well as the directed weighted bipartite graph resolved at the speech act level, with character nodes split up into speakers and listeners for visual clarity (c), where we highlight the protagonists appearing in Act III, Scene V.While the coarse-grained representation overestimates Romeo's role in Act III, Scene V (a), the finer-grained representation again highlights Juliet's bond with the Nurse (b), and the directed representation reveals the hierarchical structure of their communication (c).

15 Figure 7 :
Figure 7: Spearman correlations of degree rankings in the clique and star expansions from Tab. 1 for Romeo and Juliet (bottom), and residuals after subtracting the average correlations in the HYPERBARD corpus (top).

401Cre.
Did you discuss the problem with the data?Hyp.I laid it out for them, to no avail.Col.You surely got me thinking, but-Prof.Enough!My patience is exhausted.Think?Produce! [To Col. ] You, give productive treatment to that thinker.Exit COLLEAGUE with HYPERBARD.[To Cre.] And you, fix these few figures; faugh R2.Exeunt.ACT V. SCENE I.-The Community.COLLEAGUE's Office.GRAPH, invisible, floating by the window.Enter COLLEAGUE, carrying a jar.

Figure 8 :Figure 9 :Figure 10 :
Figure8: Named characters in Romeo and Juliet, ranked by their degree in the clique expansion (ce) and star expansion (se) representations from Tab. 1.We omit the se-speech-mwd representation because its ranking is equivalent to that of the se-speech-wd representation by construction.While Romeo is ranked first under all representations, the rankings differ, inter alia, in the prominence assessment of side characters, such as the Nurse or Friar Lawrence.
The Community.In the dining hall.PROFESSOR, SENIOR RESEARCHER, and COLLEAGUE seated at a table.Enter CREATURE, carrying a tray.SCENE II.-CREATURE's office.In a corner, on the floor, CREATURE, in contemplation.

Table 1 :
Overview of relational data representations provided with HYPERBARD for each play attributed to William Shakespeare, based on the TEI simple-encoded XMLs provided by Folger Digital Texts [13].Unidirectional arrows indicate assignment; bidirectional arrows indicate bijection.We highlight the transformations most commonly used in the literature.Exit GRAPH.HYPERBARD settles by the office plant.SCENE II.-PROFESSOR'sOffice.Enter PROFESSOR and CREATURE.Prof.The judgment's in, you have no time to spare: 345 That future work in the Community 346 May operate with more representations!347 Enter PROFESSOR.Prof. What's all this noise?The rules!No visitations!348 Cre.Let me explain-349 Prof. Save me your explanations!I want you in my office, now!And when 350 We're done, this dirty stray thing must be gone!351 Exeunt PROFESSOR and CREATURE.Gra.Your honor, I foresaw this would be dangerous.352 Hyp.You see their wielding of authority?353 So far up in the hierarchy, so long, 354 And funeral their only honest feedback.355 I'm not afraid, but let us maybe make 356 Our data case not at the top to start with.357 Gra.When floating down the hall I think I saw 358 The perfect target for us to attack.359 Hyp.What's with this war rhetoric?360 Gra.I'll be back.361 They hand CREATURE a sheet of paper.Prof. Accept, well done, but now in camera's near.362 Cre.They're taking months, and now we're given days?363 Additional experiments?But how? 364 No space!What should I do about R2? 365 Prof. That's up to you-it will not change a thing.366 Cre.[Aside] That's comforting.