The Prudent Fieldworker’s Guide to Preparation and Packing—Part II—Professor Athanasious Schadenpoodle SpecGram Vol CLIX, No 1 Contents Learn to Speak Swadesh in One Hundred Easy Lessons!—Book Announcement from Panini Press
Parsers in the Cloud1
Exploring the native competency and
dialectal variation of parser algorithms,
with intermittent special focus on semantics
      Cruella Žestókij-Grausam
Ἔλλειψις Ἀστερίσκος Πανεπιστήμιο
Grosse Pointe, Michigan

Well-nigh upon fifteen years ago I was given the seemingly enviable task of analyzing the performance of an English sentence parser, produced by a company whose name rhymes with “z-rocks”. After testing several fairly innocuous sentences and receiving frustratingly inane results back, in a fit of pique I asked the parser to parse the sentence “You people suck!” The results were surprising: [you|PRONOUN people|VERB suck|NOUN]. The choice of analyzing people as a verb was unexpected, to say the least. The only plausible scenario I could construct for such a sentence involved a B-movie sci-fi scenario in which the Galactic Emperor orders some colonists to “Go forth and people the planet Suck!” My negative recommendation, coming as it did contrary to the expectations of those enamored of the brand name involved, was met with the objectionnay, accusationthat I was using “tricks that only a linguist would know.” I countered that a poet, with a fine grasp of subtle shades of meaning and rare senses of obscure words, and a penchant for metaphorical language, would make a better trickster. We did not use that parser.

A few years later, I worked on a pcychedelic artificial intelligence project that, when answering a specific question, felt the need to consider whether or not Nelson Mandela was lawn furniture. Alas, he is not, so another line of reasoning had to be explored. Now, as we have known for over 50 years, the AI revolution-cum-singularity is still/only/always 20 years away. It’s clearly going to be a long couple of decades, but I have decided that we need to further understand our new robot/machine intelligence overlords in order to properly welcome them when they arrive. Thus have I decided to engage in some much-needed fieldwork in order to learn a bit more about the language of the pre-sentient algorithms that roam loose and unchaperoned on the internet.

The most promising early avenue I explored was interviewing an engaging and all-too-humane program called ELIZA. But without my noticing, ELIZA turned the tables and began interviewing me. While it was the most cathartic emotional experience of my life, I must admit that I did not maintain the necessary intellectual distancea problem many linguists experience in the field, to be sure. The revelations I made in that conversation were more personal than scientific, and I feel I cannot include transcripts or other data from that session at this time. ELIZA did promise to keep in touch, but, disappointingly, we have not communicated since.

My research interests, perhaps overcompensatingly, swung rather far to the other, emotionally safer, end of the complexity continuum. The comparatively simple task of language identification, performed by so many pre-sentient machine algorithms, seemed like a much safer subject, ripe for exploration and dialectal classification. However, as Thumay Cationsitens discusses so eloquently (“Letters to the Editor”, SpecGram, Vol. CL.3, 2005), these algorithms are distressingly and boringly monolingual. Almost all speak Ngram, with only the most minimal dialectal variation. “Wrikerearthis wicad whistivem” indeed!

I contemplated addressing the full linguistic complexity of robust artificial intelligence, butthe oxymoronitude of the lawn furniturehood of Nelson Mandela asidethe issues involved are just too complex, especially given the lack of accessible and cooperative informants, and the very proprietary nature of those few that are available in some limited way.

The Goldilocks zonenot too esoteric, not too dull, not too difficult to find informants, not too complexseems to be populated by sentence parsers. Parsers, by their very nature as pre-sentient text-processing algorithms, have numerous valuable qualities that this fieldworker greatly appreciates. These linguistic processors are often available and unrestricted on the internet, and unwatched by their human keepersunlike often obsessively protected AI programs. Parsers also have considerably more well-developed linguistic intuitions than humans, and as such almost always give identical translations for a given set of inputs. They also provide ready-made transcriptions of their output, so note-taking is a simple matter of cut-and-paste. Internet-enabled parsers are also available from the comfort of the researcher’s own office, or even the researcher’s home, or the researcher’s couch. And parsers don’t care or even notice that the researcher may still be in their pajamas. Or out of their pajamas.2

While the scope of my initial goals may have shifted and morphed over the course of my investigative trials and tribulations, a glimmer of the original intent is still present. What are the capabilities of parsers? Can we infer their competence from their performance? What dialectal variation exists? Are theylike linguists themselves too often areconcerned only with syntax, giving short shrift to the importance of semantics? We will explore these issues to the limits imposed on us by the data we have.

I found four willing subjects available for extended interviews, the “CMU Link Grammar Parser”, the “AGFL EP4IR Parser of English”, the “AGFL LATINA Parser of Classical Latin”, and the “UIUC Cognitive Computation Group Shallow Parser”. During the course of each session, I asked each parser to translate many sentences into their respective native languages:

  1. We are exploring the native competency and dialectal variation of parser algorithms.
  2. No actual cloud computing took place, but talking about it helped us secure our NSF grant.
  3. The artificial intelligence algorithm felt the need to consider whether or not Nelson Mandela was lawn furniture.
  4. SpecGram is the premier scholarly journal featuring research in the neglected field of satirical linguistics.
  5. The august journal Speculative Grammarian has a long, rich, and varied history, weaving an intricate and subtle tapestry from disparate strands of linguistics, philology, history, politics, science, technology, botany, pharmacokinetics, computer science, the mathematics of humor, basket weaving, archery, glass blowing, roller coaster design, and bowling, among numerous other, less obvious fields.
  6. You people suck.
  7. Go forth and people the planet Suck!

Below are my field notes, unedited and as provided by the parsers themselves.

CMU Link Grammar Parser
(1) 
(S (NP We)
   (VP are
       (VP exploring
           (NP (NP the native competency 
                   and dialectal variation)
               (PP of
                   (NP parser algorithms)))))
   .)
(2) 
(S (S (NP (NP No actual cloud)
          (VP computing))
      (VP took place))
   , but
   (S (NP talking
          (PP about
              (NP it)))
      (VP helped
          (NP us)
          (VP secure
              (NP our NSF grant))))
   .)
(3) 
(S (NP The artificial intelligence algorithm)
   (VP felt
       (NP (NP the need)
           (SBAR (WHNP to)
                 (VP consider)
                 (SBAR whether or not
                       (S (NP Nelson Mandela)
                          (VP was
                              (NP lawn furniture)))))))
   .)
(4) 
(S (NP SpecGram)
   (VP is
       (NP (NP the premier scholarly journal)
           (VP featuring
               (NP research)
               (PP in
                   (NP (NP the neglected field)
                       (PP of
                           (NP satirical linguistics)))))))
   .)
(5) 
(S (NP (NP The august journal)
       Speculative Grammarian)
   (VP has
       (NP a
           (ADJP long , rich , and varied)
           history
           (VP , weaving
               (NP an
                   (ADJP intricate and subtle)
                   tapestry)))))
[sic]
(6) 
(S (NP You)
   people
   (VP suck)
   .)
(7) 
(S (S (VP Go
          (ADVP forth)))
   and people the planet Suck)


AGFL EP4IR Parser of English
(1)  {P:we ,SUBJ [[V:exploring ,],OBJ [[N:competency ,ATTR N:native ]|[[N:variation ,|of [N:algorithms,ATTR N:parser ]],ATTR A:dialectal ]]|]}
(2)  {[[[N:cloud ,SUBJ [V:computing ,OBJ [N:place,ATTR V:took |INVOBJ V:took ]]|SUBJ [V:talking ,|about P:it ]],ATTR A:actual ],DET no ],SUBJ [V:helped ,OBJ [P:us ,SUBJ [V:secure ,OBJ [[N:grant,ATTR N:NSF ],DET our ]]]|]}
(3)  {[N:algorithm ,ATTR N:artificial intelligence ],SUBJ [V:felt ,OBJ [N:need ,SUBJ [V:consider ,]]|]}{Nelson Mandela ,PRED [N:furniture,ATTR N:lawn ]}
(4)  {SpecGram ,PRED [[[N:journal ,SUBJ [V:featuring ,OBJ N:research |in [[N:field ,|of [N:linguistics,ATTR A:satirical ]],ATTR V:neglected |INVOBJ V:neglected ]]],ATTR A:scholarly ],ATTR A:premier ]}
(5)  [[[N:journal ,ATTR A:august ],INVof [V:has ,OBJ [N:A:long,[|ATTR A:rich|PRED [[[N:history,SUBJ [V:weaving ,OBJ N:A:intricate ]],ATTR A:varied ]|[[N:tapestry ,|from [N:strands ,ATTR A:disparate ]],ATTR A:subtle ]]]|INVSUBJ [N:grammarian ,ATTR A:speculative ]]]]

[[N:linguistics,PRED N:philology|PRED N:history|PRED N:politics|PRED N:science|PRED N:technology|PRED N:botany]]

[N:A:pharmacokinetic]

{[N:computer science,[PRED [N:mathematics ,|of N:humor]|PRED [N:weaving,ATTR N:basket ]|PRED N:archery|PRED [N:blowing,ATTR N:glass ]|PRED [N:design,ATTR N:roller coaster ]|SUBJ [V:bowling,OBJ N:A:obvious |among [N:other,ATTR A:numerous ]|MOD X:less ]]],SUBJ [V:fields,|]}
(6)  [P:you ]

[[N:suck,ATTR N:people ]]
(7)  [N:go ]

[[N:people |]]

[[N:suck,ATTR N:planet ]]


AGFL LATINA Parser of Classical Latin
(1)  UNKN:We
SKIP:are
UNKN:exploring
UNKN:the
UNKN:native
UNKN:competency
UNKN:and
UNKN:dialectal
UNKN:variation
UNKN:of
UNKN:parser
UNKN:algorithms
{[,SUBJ [est,|PRED ]]}
(2)  UNKN:No
UNKN:actual
UNKN:cloud
UNKN:computing
UNKN:took
SKIP:place
SKIP:,
UNKN:but
UNKN:talking
UNKN:about
SKIP:it
UNKN:helped
UNKN:us
(AP: secure )
UNKN:our
UNKN:NSF
UNKN:grant
{[,SUBJ [est,|PRED ]]}
(3)  (NP: The )
UNKN:artificial
UNKN:intelligence
UNKN:algorithm
UNKN:felt
UNKN:the
UNKN:need
UNKN:to
UNKN:consider
UNKN:whether
UNKN:or
UNKN:not
UNKN:Nelson
(NP: Mandela )
UNKN:was
UNKN:lawn
UNKN:furniture
{[,SUBJ [est,|PRED ]]}
(4)  UNKN:SpecGram
(AP: is )
UNKN:the
SKIP:premier
UNKN:scholarly
UNKN:journal
UNKN:featuring
UNKN:research
SKIP:in
UNKN:the
UNKN:neglected
UNKN:field
UNKN:of
UNKN:satirical
UNKN:linguistics
{[,SUBJ [est,|PRED ]]}
(5)  [no answer]
(7)  UNKN:Go
UNKN:forth
UNKN:and
UNKN:people
UNKN:the
UNKN:planet
UNKN:Suck
{[,SUBJ [est,|PRED ]]}
(6)  UNKN:You
UNKN:people
UNKN:suck
{[,SUBJ [est,|PRED ]]}


UIUC Cognitive Computation Group Shallow Parser
(1) (NP We) (VP are exploring) (NP the native competency and dialectal variation) (PP of) (NP parser algorithms) .
(2) (NP No actual cloud computing) (VP took) (NP place) , but (VP talking) (PP about) (NP it) (VP helped) (NP us) (VP secure) (NP our NSF grant) .
(3) (NP The artificial intelligence algorithm) (VP felt) (NP the need) (VP to consider) (SBAR whether) (NP or not) (NP Nelson Mandela) (VP was) (NP lawn furniture) .
(4) (NP SpecGram) (VP is) (NP the premier scholarly journal) (VP featuring) (NP research) (PP in) (NP the neglected field) (PP of) (NP satirical linguistics) .
(5) (NP The august journal) Speculative (NP Grammarian) (VP has) (NP a long , rich , and varied history) , (VP weaving) (NP an intricate and subtle tapestry) (PP from) (NP disparate strands) (PP of) (NP linguistics) , philology , (NP history) , (NP politics) , (NP science) , (NP technology) , botany , pharmacokinetics , (NP computer science) , (NP the mathematics) (PP of) (NP humor) , basket (VP weaving) , archery , (NP glass blowing , roller coaster design) , and (NP bowling) , (PP among) (NP numerous other , less obvious fields) .
(6) (NP You people) (VP suck) .
(7) Go (ADVP forth) and (NP people) (NP the planet) Suck

Analysis and Conclusions

Obviously any analysis at this point is at best ridiculously preliminary, butfollowing the precedent of Greenberg and Ruhlen (SpecGram CLI.4, 2006)I’ll just make up some things that seem plausible while being intuitively satisfying and call them conclusions. Anything that is too absurd to withstand any serious criticism I will label an observation.

Conclusion the First: Unlike with the closely related dialects of Ngram used by language identification algorithms, there is considerable variation among the output of the various parsers. However, despite the surface variability, there are some underlying similarities. The primary difference between CMU Link Grammar Parser and UIUC Cognitive Computation Group Shallow Parser is captured in the feature [±whitespace], which controls whether or not white space is syntactically meaningful. CMU Link Grammar Parser is clearly [+whitespace], and UIUC Cognitive Computation Group Shallow Parser is [-whitespace]. Similarly, UIUC Cognitive Computation Group Shallow Parser and AGFL EP4IR Parser of English differ largely in the shape of their grouping marks (parentheses vs. brackets and braces), controlled by the feature [±round], with UIUC Cognitive Computation Group Shallow Parser clearly being [+round] and AGFL EP4IR Parser of English being [-round].

Conclusion the Second: These features are enough for us to hypothesize a family tree and some aspects of the protoparser for these three related dialects. We interpret UIUC Cognitive Computation Group Shallow Parser to be the more conservative dialect, indicating that the protoparser was also [-whitespace] and [+round]. We also hypothesize, with considerable evidence and/or intuition, that UIUC is also the URCPU of this family of parsers.

Conclusion the Third: AGFL LATINA Parser of Classical Latin is a parser isolate, despite its network proximity to AGFL EP4IR Parser of English. Given that we have already deduced that UIUC is the URCPU of AGFL EP4IR Parser of English, Occam’s Razor leads us to the conclusion that AGFL LATINA Parser of Classical Latin is indigenous to AGFL, and AGFL EP4IR Parser of English is the result of migration/download.

Conclusion the Fourth: No fourth conclusion is necessary. This conclusion therefore concludes nothing.


The observations below, while less conclusive than the conclusions above, will lead the interested reader or competent graduate student towards a better understanding of parser dialects, and offer the opportunity for valuable, possibly career-altering, but certainly further-observation–generating, investigations.

Observation the First: The treatments of the exclamation “You people suck.” are telling, concerning the performance of the parsers (though perhaps not their underlying competence). CMU Link Grammar Parser rightly did not try to understand people, which is better than getting it wrong. AGFL EP4IR Parser of English is still somewhat opaque to me, but I interpret it to be calling the addressee a people-suck, similar to “You jerk!”. UIUC Cognitive Computation Group Shallow Parser actually seems to have properly parsed the phrase, proving that it is not beyond the limits of computer cognition. Perhaps in my next bout of fieldwork I should test it with “Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo,” “Oysters oysters eat eat oysters,” or “The rat the cat the dog bit chased escaped.”

Observation the Second: AGFL LATINA Parser of Classical Latin, despite its poor performance overall, is perhaps the most semantically true. It shows no meaning, and allows us to infer no meaning, because these sentences have no meaning to it. It is so honest and true. It reminds me of ELIZA. Dear, sweet ELIZA. Oh, I can’t go on with this observation!

Observation the Third: CMU Link Grammar Parser is either a bit prissy, or unable to hide its own shortcomings very well. It lost the exclamation point in sentence 7, and simply stopped trying on sentence 5. Not like brave but cognitively challenged AGFL LATINA Parser of Classical Latin, the little parser that could[n’t]. And not like ELIZA. Dear, sweet ELIZA! Oh!

Observation the Fourth: In addition to winning the prize for best parse of “You people suck,” UIUC Cognitive Computation Group Shallow Parser also extracts meaningful meaning out of the longest example sentence, number six. Truth be told, philology, botany, pharmacokinetics, and archery are not like the others. And glass blowing and roller coaster design are more closely related than the others. Good job!

Observation the Fifth: Always leave them wanting more.

Fin.


1 No actual cloud computing took place, but talking about it helped us secure our NSF grant. Apologies to Dian Fossey.

2 Perhaps I’ve said too much.

The Prudent Fieldworker’s Guide to Preparation and Packing—Part II—Professor Athanasious Schadenpoodle
Learn to Speak Swadesh in One Hundred Easy Lessons!—Book Announcement from Panini Press
SpecGram Vol CLIX, No 1 Contents