Natural language processing made easier, with pipes
At this year’s OpenNews Code Convening, Alex Spangher of the New York Times and I worked on broca, which is a Python library for rapidly experimenting with new natural language processing (NLP) approaches.
Conventional NLP methods—bag-of-words or vector space representations of documents, for example—generally work well, but sometimes not well enough, or worse yet, not well at all. At that point, you might want to try out a lot of different methods that aren’t available in popular NLP libraries.
Prior to the code convening, broca was little more than a hodgepodge of algorithms I’d implemented for various projects. During the convening, we restructured the library, added some examples and tests, and implemented in the key piece of broca: pipelines.
The core of broca is organized around pipes, which take some input and produce some output, which are then chained into pipelines.
Pipes represent different stages of an NLP process—for instance, your first stage may involve preprocessing or cleaning up the document, the next may be vectorizing it, and so on.
In broca, this would look like:
from broca.pipeline import Pipeline from broca.preprocess import Cleaner from broca.vectorize import BoW docs = [ # ... # some string documents # ... ] pipeline = Pipeline( Cleaner(), BoW() ) vectors = pipeline(docs)
Since a key part of broca is rapid prototyping, it makes it very easy to simultaneously try different pipelines which may vary in only a few components:
from broca.vectorize import DCS pipeline = Pipeline( Cleaner(), [BoW(), DCS()] )
This would produce a multi-pipeline consisting of two pipelines: one which vectorizes using
BoW, the other using
Multi-pipelines often have shared components. In the example above,
Cleaner() is in both pipelines. To avoid redundant processing, a key part of broca’s pipelines is that the output for each pipe is “frozen” to disk.
These frozen outputs are identified by a hash derived from the input data and other factors. If frozen output exists for a pipe and its input, that frozen output is “defrosted” and returned, saving unnecessary processing time.
This way, you can tweak different components of the pipeline without worrying about needing to re-compute a lot of data. Only the parts that have changed will be re-computed.
broca includes a few pipes:
broca.tokenizeincludes various tokenization methods, using lemmas and a few different keyword extractors.
broca.vectorizeincludes a traditional bag-of-words vectorizer, an implementation of “dismabiguated core semantics,” and Doc2Vec.
broca.preprocessincludes common preprocessors—cleaning punctuation, HTML, and a few others.
Not everything in broca is a pipe. Also included are:
broca.similarityincludes similarity methods for terms and documents.
broca.distanceincludes string distance methods (this may be renamed later).
broca.knowledgeincludes some tools for dealing with external knowledge sources (e.g. other corpora or Wikipedia).
Though at some point these may also become pipes.
Give Us Your Pipes!
We made it really easy to implement your own pipes. Just inherit from the
Pipe class, specify the class’s
output types, and implement the
__call__ method (that’s what’s called for each pipe).
from broca.pipeline import Pipe class MyPipe(Pipe): input = Pipe.type.docs output = Pipe.type.vecs def __init__(self, some_param): self.some_param = some_param def __call__(self, docs): # do something with docs to get vectors vecs = make_vecs_func(docs, self.some_param) return vecs
We hope that others will implement their own pipes and submit them as pull requests—it would be great if broca becomes a repository of sundry NLP methods which make it super easy to quickly try a battery of techniques on a problem.
broca is available on GitHub and also via
pip install broca
Francis Tseng is a designer, data developer, and past Knight-Mozilla OpenNews interested in how automation, simulation, and machine learning relate to social and political issues. He is currently working on the Coral Project’s data analysis systems.