You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Chip Hollingsworth 8d5dcef488 Added source code files, updated README.md 2 years ago
.gitignore Initial commit 2 years ago
LICENSE Initial commit 2 years ago
README.md Added source code files, updated README.md 2 years ago
compare_rules.py Added source code files, updated README.md 2 years ago
corpus Added source code files, updated README.md 2 years ago
corpus_generator.py Added source code files, updated README.md 2 years ago
data Added source code files, updated README.md 2 years ago
delete_rules.py Added source code files, updated README.md 2 years ago
grammar_inducer.py Added source code files, updated README.md 2 years ago
hollingsworth-ling8570-final.pdf Added source code files, updated README.md 2 years ago
original.lt Added source code files, updated README.md 2 years ago
original.yld Added source code files, updated README.md 2 years ago
outliers Added source code files, updated README.md 2 years ago
outliers.py Added source code files, updated README.md 2 years ago

README.md

grammar_inducer

Implementation of a grammar inducer I wrote in Python for a class project

This is a project I completed while in graduate school in Artificial Intelligence at The University of Georgia, for Dr. Michael Covington’s CSI/LING 8570 class in Natural Language Processing, in the spring of 2010. While the class had traditionally focused on Prolog, this particular iteration focused instead on Python and the Natural Language Toolkit (NLTK).

For my final project in the class, I decided to implement a grammar inducer described in Glenn Carroll and Eugene Charniak’s 1992 paper, Two Experiments on Learning Probabilistic Dependency Grammars from Corpora. Though the paper was already fairly dated, and the inducer described therein was not particularly effective, I felt that an attempt at implementing it in Python with NLTK would be an interesting exercise and a good demonstration of the skills I learned in the class.

Because there was, at the time, no good Python implementation of the Inside-Outside algorithm (that I was aware of), and it would have taken too long to try to implement it myself in Python, I decided to use Mark Johnson’s C implementation, which I found at his free software page. A direct download link to the version I used can be found here, though an updated version is now available, which I have not tried. Since there is no official open source license included with the software, and no mention on the page of any rules regarding redistribution, I have decided to play it safe and not include the software in this repository. To use my grammar inducer, you will need to download the Inside-Outside software from the above link, compile it, and place the resulting “io” executable in the project folder. (I have only tried compiling and running this software on Linux, and cannot confirm whether it works on other operating systems.)

The main program is in grammar_inducer.py. It includes a command-line interface. Simply run ‘python grammar_inducer.py’ and follow the instructions provided. This program must be run in a directory for which write access is allowed, and the “io” executable must be in the same directory.

The rest of the python programs included were used in generating the sample corpora, and for producing the data described in section 6 of my paper (“A second attempt”). I have also included the relevant outputs. The files original.lt and original.yld are the grammar and yield files supplied to the input-output algorithm. The grammar contains all rules that match the supplied corpus, with no restrictions or pruning. The program delete_rules.py takes these files and generates a whole plethora of new grammar files, each consisting of the original grammar minus one rule. The file data lists all the rules and the entropy of the grammar that results from deleting each rule. If the resulting grammar cannot parse the corpus, an entropy of 1000 is given.

The program outliers.py takes this data file and extracts only those rules whose associated entropy is different from the mode entropy. It then sorts and prints these rules. The file outliers is the output of this program.

Finally, the program compare_rules.py compares the contents of the outliers file to the original grammar used to generate the corpus, and lists which original rules were not found among the outliers.