Data Wrangling

I decided to roll our own form and review management for the Network Music Festival despite how much work this was the last time I did it. I also wanted to stop using Google and to have more flexibility.

So we deployed Ninja Forms, which has high ratings and worked great on a test website.

But then we started getting some emails. Some forms would not submit. There were no error messages in the logs. I tried a lorem ipsum generator and it wasn’t the amount of text. It wasn’t the special symbols. It worked great on the test website!

So we switched to google docs.

So then, when the call ended, I had to merge a bunch of spreadhseets together and then break them up again by what sort of performance people submitted. This is tedious and there’s always a risk of clobbering somebody’s data.

Then everything must get an ID. Rather than write a macro, I did it by hand. Which, again, is risky.

Finally, that done, I need to generate documents or whatever for reviewers to look at. In this case, I did write a macro. But then I wanted to convert all the URLs into clickable links. So I learned a little bit about how the python API works with LibreOffice.

I generally think of myself as fairly intelligent, but I found the documentation extremely hard to navigate. It definitely would have been faster and easier to also do this by hand, but by then the time reading the documents was a sunk cost.

So here’s how I’ve done it. I did a merge with the spreadsheets and a template, which is also a convoluted process that tends to crash things, because why wouldn’t it? I generated a large number of .odf files, one for each submission.

In open office, I went to Tools → AutoCorrect Options... → Options tab →
☑ ☑ URL Recognition

I turned off everything else in autocorrect. I don’t want to mangle anybody’s spelling capitalisation, etc.

Then I wrote a Python script:


# import socket  # only needed on win32-OOo3.0.0
import uno

# get the uno component context from the PyUNO runtime
localContext = uno.getComponentContext()

# create the UnoUrlResolver
resolver = localContext.ServiceManager.createInstanceWithContext(
				"com.sun.star.bridge.UnoUrlResolver", localContext )

# connect to the running office
ctx = resolver.resolve( "uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext" )
smgr = ctx.ServiceManager

# get the central desktop object
desktop = smgr.createInstanceWithContext( "com.sun.star.frame.Desktop",ctx)

# access the current writer document
model = desktop.getCurrentComponent()

# get the dispatcher
dispatcher = smgr.createInstanceWithContext( "com.sun.star.frame.DispatchHelper", ctx)
doc = model.getCurrentController()

# run the commands
dispatcher.executeDispatch(doc, ".uno:AutoFormatApply", "", 0, ())
dispatcher.executeDispatch(doc, ".uno:Save",  "", 0, ())
dispatcher.executeDispatch(doc, ".uno:Quit", "", 0, ())

I found a lot of useful documentation for this at: https://christopher5106.github.io/office/2015/12/06/openoffice-libreoffice-automate-your-office-tasks-with-python-macros.html

Now, obviously, that connects to a running version of libreoffice which has the document you want already open. Obviously. There’s a command line for libreoffice that does a few things, including opening a port to listen for python commands. It also does file conversions.

Before I go, I /know/ I could have done a file conversion in the python script, but sometimes, when evaluating time save shortcuts, it’s important to make a note of how long it would take to figure out the shortcut. None of the commands called in my python script take arguments. I’m sure passing arguments is straightforward and also not documented anywhere on the official site. (If this is wrong, I’m sorry, but yikes, you guys.)

So I wrote a bash script that traverses a directory and also calls the python script

#!/bin/bash

for file in *.odt
do
    soffice $file --accept="socket,host=localhost,port=2002;urp;StarOffice.ServiceManager" --headless &
    pid=$!
    sleep 3
    python3 ../urlify.py
    wait $pid
    soffice --headless --convert-to htm:HTML $file
    rm $file
done

That sleep is in there because you have to wait for writer to get going and allocate a port or else the python script hangs and everything goes to hell.

The next step is to allocate 2 reviewers (one from column A and one from column B) to every item, making sure that every reviewer gets no more than 5 items, mark this in a spreadsheet and email them off.

I expect this will also take about 5 times as long as doing it by hand.

Published by

Charles Céleste Hutchins

Leave a Reply Cancel reply