{"id":5095,"date":"2020-05-16T18:53:28","date_gmt":"2020-05-16T17:53:28","guid":{"rendered":"http:\/\/www.celesteh.com\/blog\/?p=5095"},"modified":"2020-05-16T18:53:31","modified_gmt":"2020-05-16T17:53:31","slug":"data-wrangling","status":"publish","type":"post","link":"https:\/\/www.celesteh.com\/blog\/2020\/05\/16\/data-wrangling\/","title":{"rendered":"Data Wrangling"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">I decided to roll our own form and review management for the Network Music Festival despite how much work this was the last time I did it.  I also wanted to stop using Google and to have more flexibility.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So we deployed Ninja Forms, which has high ratings and worked great on a test website.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">But then we started getting some emails. Some forms would not submit. There were no error messages in the logs. I tried a lorem ipsum generator and it wasn&#8217;t the amount of text. It wasn&#8217;t the special symbols. It worked great on the test website!<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So we switched to google docs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So then, when the call ended, I had to merge a bunch of spreadhseets together and then break them up again by what sort of performance people submitted. This is tedious and there&#8217;s always a risk of clobbering somebody&#8217;s data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Then everything must get an ID. Rather than write a macro, I did it by hand. Which, again, is risky.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, that done, I need to generate documents or whatever for reviewers to look at.  In this case, I did write a macro. But then I wanted to convert all the URLs into clickable links. So I learned a little bit about how the python API works with LibreOffice.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I generally think of myself as fairly intelligent, but I found the documentation extremely hard to navigate. It definitely would have been faster and easier to also do this by hand, but by then the time reading the documents was a sunk cost.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So here&#8217;s how I&#8217;ve done it.  I did a merge with the spreadsheets and a template, which is also a convoluted process that tends to crash things, because why wouldn&#8217;t it?  I generated a large number of .odf files, one for each submission.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In open office, I went to <code>Tools \u2192 AutoCorrect Options... \u2192 Options tab \u2192<\/code><br><code>\u2611 \u2611 URL Recognition<\/code><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I turned off everything else in autocorrect. I don&#8217;t want to mangle anybody&#8217;s spelling capitalisation, etc.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Then I wrote a Python script:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>\n# import socket  # only needed on win32-OOo3.0.0\nimport uno\n\n# get the uno component context from the PyUNO runtime\nlocalContext = uno.getComponentContext()\n\n# create the UnoUrlResolver\nresolver = localContext.ServiceManager.createInstanceWithContext(\n\t\t\t\t\"com.sun.star.bridge.UnoUrlResolver\", localContext )\n\n# connect to the running office\nctx = resolver.resolve( \"uno:socket,host=localhost,port=2002;urp;StarOffice.ComponentContext\" )\nsmgr = ctx.ServiceManager\n\n# get the central desktop object\ndesktop = smgr.createInstanceWithContext( \"com.sun.star.frame.Desktop\",ctx)\n\n# access the current writer document\nmodel = desktop.getCurrentComponent()\n\n# get the dispatcher\ndispatcher = smgr.createInstanceWithContext( \"com.sun.star.frame.DispatchHelper\", ctx)\ndoc = model.getCurrentController()\n\n# run the commands\ndispatcher.executeDispatch(doc, \".uno:AutoFormatApply\", \"\", 0, ())\ndispatcher.executeDispatch(doc, \".uno:Save\",  \"\", 0, ())\ndispatcher.executeDispatch(doc, \".uno:Quit\", \"\", 0, ())\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">I found a lot of useful documentation for this at: <a href=\"https:\/\/christopher5106.github.io\/office\/2015\/12\/06\/openoffice-libreoffice-automate-your-office-tasks-with-python-macros.html \">https:\/\/christopher5106.github.io\/office\/2015\/12\/06\/openoffice-libreoffice-automate-your-office-tasks-with-python-macros.html <\/a> <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, obviously, that connects to a running version of libreoffice which has the document you want already open. Obviously. There&#8217;s a command line for libreoffice that does a few things, including opening a port to listen for python commands. It also does file conversions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Before I go, I \/know\/ I could have done a file conversion in the python script, but sometimes, when evaluating time save shortcuts, it&#8217;s important to make a note of how long it would take to figure out the shortcut. None of the commands called in my python script take arguments. I&#8217;m sure passing arguments is straightforward and also not documented anywhere on the official site. (If this is wrong, I&#8217;m sorry, but yikes, you guys.)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So I wrote a bash script that traverses a directory and also calls the python script<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>#!\/bin\/bash\n\nfor file in *.odt\ndo\n    soffice $file --accept=\"socket,host=localhost,port=2002;urp;StarOffice.ServiceManager\" --headless &amp;\n    pid=$!\n    sleep 3\n    python3 ..\/urlify.py\n    wait $pid\n    soffice --headless --convert-to htm:HTML $file\n    rm $file\ndone<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">That sleep is in there because you have to wait for writer to get going and allocate a port or else the python script hangs and everything goes to hell.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The next step is to allocate 2 reviewers (one from column A and one from column B) to every item, making sure that every reviewer gets no more than 5 items, mark this in a spreadsheet and email them off.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I expect this will also take about 5 times as long as doing it by hand.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I decided to roll our own form and review management for the Network Music Festival despite how much work this was the last time I did it. I also wanted to stop using Google and to have more flexibility. So we deployed Ninja Forms, which has high ratings and worked great on a test website. &hellip; <a href=\"https:\/\/www.celesteh.com\/blog\/2020\/05\/16\/data-wrangling\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Data Wrangling<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"activitypub_content_warning":"","activitypub_content_visibility":"","activitypub_max_image_attachments":4,"activitypub_interaction_policy_quote":"anyone","activitypub_status":"","footnotes":""},"categories":[1],"tags":[67],"class_list":["post-5095","post","type-post","status-publish","format-standard","hentry","category-uncategorised","tag-python"],"_links":{"self":[{"href":"https:\/\/www.celesteh.com\/blog\/wp-json\/wp\/v2\/posts\/5095","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.celesteh.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.celesteh.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.celesteh.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.celesteh.com\/blog\/wp-json\/wp\/v2\/comments?post=5095"}],"version-history":[{"count":2,"href":"https:\/\/www.celesteh.com\/blog\/wp-json\/wp\/v2\/posts\/5095\/revisions"}],"predecessor-version":[{"id":5097,"href":"https:\/\/www.celesteh.com\/blog\/wp-json\/wp\/v2\/posts\/5095\/revisions\/5097"}],"wp:attachment":[{"href":"https:\/\/www.celesteh.com\/blog\/wp-json\/wp\/v2\/media?parent=5095"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.celesteh.com\/blog\/wp-json\/wp\/v2\/categories?post=5095"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.celesteh.com\/blog\/wp-json\/wp\/v2\/tags?post=5095"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}