November 22, 2016
Modern NLP pipelines use large models that need to be distributed across all the processing infrastructure. For example, in the SSIX project we’re managing models of several GBs for the financial sector. At that scale you can’t assume the models will be transferred at task submission time, neither manually. From our research, we didn’t find any well-accepted approach to solve this issue. For example, TensorFlow simply uses a git repository, while other projects such as OpenNLP start to use Maven for that. Achieving properly versioning and distribution we could better focus on testing and benchmarking our models.
Moven (models+maven) is our proof-of-concept for addressing such need, implemented relying on the Maven infrastructure to publish machine/deep learning models. The results are part of the cooperation in the SSIX project between the University of Passau (André Freitas, Leonardo Souza) and Redlink (Rupert Westenthaler and Sergio Fernández).
Today we are happy to announce the release of Moven 0.1.0: Java and Python artifacts can be downloaded from Maven Central and PyPi respectively. Because the current implementation allows to make use of Moven from both Java and Python. Although we’re targeting more specific needs of some concrete environments, such as Apache Spark or Apache Beam Runners API.
The Apache technologies represent an important role of the SSIX core stack, including project such as Beam, Spark, Kafka or Zookeeper, among many other. Therefore all the new developments presented in the conference could have a big influence in the development of the SSIX Project during this second year.
This blog post was written by SSIX partner Sergio Fernández, Software Engineer, at Redlink GmbH and Member of The Apache Software Foundation, leading some of the Big Data developments regarding analysis processing in the SSIX Project.
For the latest update, like us on Facebook, follow us on Twitter and join us on LinkedIn.