the case of the yellow cluster

org.elasticsearch.transport.ConnectTransportException or the case of the yellow cluster

Some time ago we needed to add two datanode in our ElasticSearch cluster which we happily ordered from our cloud provider. The first one joined OK and shards started moving around nicely. A happy green cluster. However upon adding a second node, the cluster started accepting shards but remained in yellow state. Consistently. Like hours. Even trying to empty the node in order to remove it was not working. Some shards would stay there forever.

Upon looking at the node logs, here is what caught our attention:

org.elasticsearch.transport.ConnectTransportException: [datanode-7][] connect_exception

A similar log entry was found in datanode-7‘s log file. What was going on here? Well, these two machines were assigned sequential IP addresses, and They could literally ping the whole internet but not find each other. To which fact the cloud provider’s support group replied:

in this case you need to configure a hostroute via the gw as our switch doesn't allow a direct communication.

Enter systemd territory then, and not wanting to make this yet another service, I defaulted to the oldest boot solution. Edit /etc/rc.local (in reality /etc/rc.d/rc.local) on both machines with appropriate routes:

ip route add via dev enp0s31f6

and enable it:

# chmod +x /etc/rc.d/rc.local
# systemctl enable rc-local

rc.local will never die. It is that loyal friend that will wait to be called upon when you need them most.

resizing a vagrant box disk

[ I am about to do what others have done before me and blog about it one more time ]

While I do enjoy working with Windows 10, I am still not using WSL (waiting for WSL2) and work with either chocolatey or a vagrant Ubuntu box. It so happens that after pulling a few docker images the default 10G disk if full and you cannot work anymore. So, let’s resize the disk:

The disk on my ubuntu/bionic64 box, is a VMDK one. So before resizing, we need to transform it to a VDI first, which is easier for VirtualBox to handle:

VBoxManage clonehd .\ubuntu-bionic-18.04-cloudimg.vmdk .\ubuntu-bionic-18.04-cloudimg.vdi --format vdi

Now we can resize it, to say 20G:

VBoxManage modifymedium disk .\ubuntu-bionic-18.04-cloudimg.vdi --resize 20000

We’re almost there. We need to tell vagrant to boot from the VDI disk now. To do so open VirtualBox and visit the storage settings of the vagrant VM. Remove the VDMK disk(s) there and add the VDI on SCSI0 port. That’s it. We’re one step closer. Close VirtualBox and vagrant up now to boot from the VDI.

Now you have a 20G disk, but still a 10G partition. parted to the rescue:

$ sudo parted /dev/sda
(parted) resizepart 

It will ask you the partition number. You answer 1 (which is the /dev/sda1). It will ask you for the end of the partition. You answer -1 (which means until the end of disk). quit and you’re out.

You have changed the partition size, but still the filesystem reports the old size. resize2fs (assuming a compatible filesystem) and:

$ sudo resize2fs /dev/sda1

Now you’re done. You may want to vagrant reload to check whether everything works fine. Once you’re sure of that you can delete the old VMDK disk.

PORT is deprecated. Please use SCHEMA_REGISTRY_LISTENERS instead.

I was trying to launch a schema-registry within a kubernetes cluster and every time I wanted to expose the pod’s port through a service, I was greeted by the nice title message:

if [[ -n "${SCHEMA_REGISTRY_PORT-}" ]]
  echo "PORT is deprecated. Please use SCHEMA_REGISTRY_LISTENERS instead."
  exit 1

This happened because I had named my service schema-registry also (which was kind of not negotiable at the time) and kubernetes happily sets the SCHEMA_REGISTRY_PORT environment variable to the value of the port you want to expose. But it turns out that this very named variable has special meaning within the container.

Fortunately, I was not the only one bitten by this error, albeit for a different variable name, but I also used the same ugly hack:

$ kubectl -n kafka-tests get deployment schema-registry -o yaml
      - command:
        - bash
        - -c
        - unset SCHEMA_REGISTRY_PORT; /etc/confluent/docker/run

How I run a small (discussion) mailing list

I used to run a fairly large (for two Ops people) mail system until 2014. I did it all, sendmail, MIMEDefang, my own milters, SPF, and a bunch of other accompanying acronyms and technologies. I’ve stopped doing that now. The big companies have both the money and the workforce to do it better. But I still run a small mailing list. It used to run on majordomo but then it started having issues with modern Perl and I moved it to Mailman. Which is kind of an overkill for just a list with 300 or so people on. So for a time I even run it manually using free GSuite (yes there was a time it was free) hosting.

Lately I wanted to really run it automated again. So I thought I should try something different that would also contribute to the general civility and high SNR of it. I’d been lurking around the picolisp mailing list for years and thought I should use it. Because it comes with a mailing list program:

#!bin/picolisp lib.l
# 19apr17abu
# (c) Software Lab. Alexander Burger

# Configuration
   *MailingList ""
   *SpoolFile "/var/mail/foobar"
   *MailingDomain ""
   *Mailings (make (in "/home/foobar/Mailings" (while (line T) (link @))))
   *SmtpHost "localhost"
   *SmtpPort 25 )

# Process mails
   (when (gt0 (car (info *SpoolFile)))
         (in *SpoolFile
            (unless (= "From" (till " " T))
               (quit "Bad mbox file") )
            (while (setq *From (lowc (till " " T)))
               (line)  # Skip rest of line and "\r\n"
                  *Name *Subject *Date *MessageID *InReplyTo *MimeVersion
                  *ContentType *ContentTransferEncoding *ContentDisposition *UserAgent )
               (while (trim (split (line) " "))
                  (let L @
                     (while (and (sub? (peek) " \t") (char))  # Skip WSP
                        (conc L (trim (split (line) " "))) )
                     (setq *Line (glue " " (cdr L)))
                     (case (pack (car L))
                        ("From:" (setq *Name *Line))
                        ("Subject:" (setq *Subject *Line))
                        ("Date:" (setq *Date *Line))
                        ("Message-ID:" (setq *MessageID *Line))
                        ("In-Reply-To:" (setq *InReplyTo *Line))
                        ("MIME-Version:" (setq *MimeVersion *Line))
                        ("Content-Type:" (setq *ContentType *Line))
                        ("Content-Transfer-Encoding:" (setq *ContentTransferEncoding *Line))
                        ("Content-Disposition:" (setq *ContentDisposition *Line))
                        ("User-Agent:" (setq *UserAgent *Line)) ) ) )
               (if (nor (member *From *Mailings) (= "subscribe" (lowc *Subject)))
                  (out "/dev/null" (echo "^JFrom ") (msg *From " discarded"))
                  (unless (setq *Sock (connect *SmtpHost *SmtpPort))
                     (quit "Can't connect to SMTP server") )
                        (pre? "220 " (in *Sock (line T)))
                        (out *Sock (prinl "HELO " *MailingDomain "^M"))
                        (pre? "250 " (in *Sock (line T)))
                        (out *Sock (prinl "MAIL FROM:" *MailingList "^M"))
                        (pre? "250 " (in *Sock (line T))) )
                     (quit "Can't HELO") )
                  (when (= "subscribe" (lowc *Subject))
                     (push1 '*Mailings *From)
                     (out "Mailings" (mapc prinl *Mailings)) )
                  (for To *Mailings
                     (out *Sock (prinl "RCPT TO:" To "^M"))
                     (unless (pre? "250 " (in *Sock (line T)))
                        (msg T " can't mail") ) )
                  (when (and (out *Sock (prinl "DATA^M")) (pre? "354 " (in *Sock (line T))))
                     (out *Sock
                        (prinl "From: " (or *Name *From) "^M")
                        (prinl "Sender: " *MailingList "^M")
                        (prinl "Reply-To: " *MailingList "^M")
                        (prinl "To: " *MailingList "^M")
                        (prinl "Subject: " *Subject "^M")
                        (and *Date (prinl "Date: " @ "^M"))
                        (and *MessageID (prinl "Message-ID: " @ "^M"))
                        (and *InReplyTo (prinl "In-Reply-To: " @ "^M"))
                        (and *MimeVersion (prinl "MIME-Version: " @ "^M"))
                        (and *ContentType (prinl "Content-Type: " @ "^M"))
                        (and *ContentTransferEncoding (prinl "Content-Transfer-Encoding: " @ "^M"))
                        (and *ContentDisposition (prinl "Content-Disposition: " @ "^M"))
                        (and *UserAgent (prinl "User-Agent: " @ "^M"))
                        (prinl "^M")
                           ((= "subscribe" (lowc *Subject))
                              (prinl "Hello " (or *Name *From) " :-)^M")
                              (prinl "You are now subscribed^M")
                              (prinl "****^M^J^M") )
                           ((= "unsubscribe" (lowc *Subject))
                              (out "Mailings"
                                 (mapc prinl (del *From '*Mailings)) )
                              (prinl "Good bye " (or *Name *From) " :-(^M")
                              (prinl "You are now unsubscribed^M")
                              (prinl "****^M^J^M") ) )
                        (echo "^JFrom ")
                        (prinl "^J-- ^M")
                        (prinl "UNSUBSCRIBE: mailto:" *MailingList "?subject=Unsubscribe^M")
                        (prinl ".^M")
                        (prinl "QUIT^M") ) )
                  (close *Sock) ) ) )
         (out *SpoolFile (rewind)) ) )
   (call "fetchmail" "-as")
   (wait `(* 4 60 1000)) )

# vi:et:ts=3:sw=3

You do not have to understand a lot of Lisp to know how this is configured.
– Line 7 is where you put your mailing list’s address
– Line 8 is where incoming mail for the mailing list is saved. If you have a mail server and you run this there, just decide to run the list as a plain user and point to the user’s mailbox under /var/mail. Or, if you run fetchmail point to the file fetchmail saves incoming mail. See line 96 for that.
– Related to the above: Since I run this on my mail server, I delete line 96.
– Line 9 is my machine’s HELO/EHLO name.
– Line 10 is where the list membership is saved.
– Line 11 is the outgoing mail server.
– Line 12 is the outgoing’s mail server SMTP listening port.

Since my own incoming mail server is on a cloud provider and does not enjoy good sending reputation, I am making use of mailgun as a forwarding service. My mail server forwards to mailgun and mailgun delivers to the recipients. If you need to run your own small mail server there are tons of tutorials out there. While I am a die-hard sendmail person, I am running Postfix these days.

The final issue that remains is how to run the mailing list processor. mailing is a console program and you need to run it as a daemon somehow. Docker to the rescuse. I am building an image with the following Dockerfile:

FROM debian:buster
RUN apt-get update && apt-get install -y picolisp
WORKDIR /usr/src
COPY ./mailing .
CMD ./mailing

and I am executing this with:

docker run -d --name=foobar --restart=unless-stopped -v /var/mail/foobar:/var/mail/foobar -v /home/foobar/Mailings:/home/foobar/Mailings foobar_image

I can even inspect the logs with docker logs foobar and have made it resilient through reboots with one command.

There. I hope this gives you some ideas on how to run your own (small) mailing list. With some tinkering, you do not even need to run your own incoming mail server. It can be a mailbox in Gmail or elsewhere, you fetch mails with fetchmail locally, and submit to a relay service and done. Also note: Relay services require payment after some threshold.

ansible, timezone and JDK8

While one might think that they can change the timezone of a machine with ansible with:

  - name: set /etc/localtime
      name: UTC

with some JDK apps it is not enought, because the JVM looks at /etc/timezone. So you need to update that file too. Maybe in a more cleaner way than:

  - name: set /etc/timezone
    shell: echo UTC > /etc/timezone
      creates: /etc/timezone

Kafka, dotnet and SASL_SSL

This is similar to my previous post, only now the question is, how do you connect to a Kafka server using dotnet and SASL_SSL? This is how:

// based on

using Confluent.Kafka;
using System;
using System.IO;
using System.Text;
using System.Threading.Tasks;
using System.Collections.Generic;

namespace Confluent.Kafka.Examples.ProducerExample
    public class Program
        public static async Task Main(string[] args)
            string topicName = "test-topic";

            var config = new ProducerConfig {
                BootstrapServers = "",
                SecurityProtocol = SecurityProtocol.SaslSsl,
                SslCaLocation = "ca-cert",
                SaslMechanism = SaslMechanism.Plain,
                SaslUsername = "USERNAME",
                SaslPassword = "PASSWORD",
                Acks = Acks.Leader,
                CompressionType = CompressionType.Lz4,

            using (var producer = new ProducerBuilder<string, string>(config).Build())
                for (int i = 0; i < 1000000; i++)
                    var message = $"Event {i}";

                        // Note: Awaiting the asynchronous produce request
                        // below prevents flow of execution from proceeding
                        // until the acknowledgement from the broker is
                        // received (at the expense of low throughput).

                        var deliveryReport = await producer.ProduceAsync(topicName, new Message<string, string> { Key = null, Value = message } );
                        // Console.WriteLine($"delivered to: {deliveryReport.TopicPartitionOffset}");

                        // Let's not await then
                        // producer.ProduceAsync(topicName, new Message<string, string> { Key = null, Value = message } );
                        // Console.WriteLine($"Event {i} sent.");
                    catch (ProduceException<string, string> e)
                        Console.WriteLine($"failed to deliver message: {e.Message} [{e.Error.Code}]");

                // producer.Flush(TimeSpan.FromSeconds(120));

                // Since we are producing synchronously, at this point there will be no messages
                // in-flight and no delivery reports waiting to be acknowledged, so there is no
                // need to call producer.Flush before disposing the producer.

Since I am a total .NET newbie, I usually docker run -it --rm microsoft/dotnet and experiment from there.

Kafka, PHP and SASL_SSL

When you want to connect to a Kafka cluster from PHP there are numerous examples showing how to use php-rdkafka, but unauthenticated. But what happens when you need to let a customer connect to a Kafka setup and IP whitelisting is not enough? Not much easily locatable information is out there.

Why not correct this by combing through various web pages and the librdkafka source code:


$conf = new RdKafka\Conf();
$conf->set('security.protocol', 'SASL_SSL');
$conf->set('sasl.mechanisms', 'PLAIN');
$conf->set('sasl.username', 'USERNAME_HERE');
$conf->set('sasl.password', 'PASSWORD_HERE');
$conf->set('', '/usr/local/etc/ca-cert.pem');
$conf->set('ssl.cipher.suites', 'TLSv1.2');

$rk = new RdKafka\Producer($conf);

$topic = $rk->newTopic("kafka-test-topic");

for ($i = 0; $i < 10; $i++) {
    $topic->produce(RD_KAFKA_PARTITION_UA, 0, "Message $i");

while ($rk->getOutQLen() > 0) {


Still this may not be enough if it is the case that your Kafka server is on OpenSSL-1.0.2 (CentOS 7 for example) and your php client is on OpenSSL-1.1.0 (like the php:7.2-cli docker image). In such a case you need to alter your client’s openssl.cnf to comment out the following line:

;CipherString = DEFAULT@SECLEVEL=2

Wasting time with gawk while parsing lsof output

So I wanted to parse lsof, to see on what ports was a machine accepting connections. Normally one would write something like:

# lsof -Pn -i | grep LISTEN | awk '{print $9}' | cut -d: -f2 | sort -n | uniq

You get a sorted list of the open ports and are done with it. But why invoke four different programs to do extraction and sorting, when gawk is a complete programming language? Yes it is possible to do it with gawk in one go (and learn something in the process):

# lsof -Pn -i | awk '/LISTEN/ { split($9, a, ":"); b[a[2]] = 1; } END { n = asorti(b, c, "@ind_num_asc"); for (i = 1; i <= n; i++) { print c[i]; } }'

The /LISTEN/ effectively greps the lsof output for lines containing LISTEN and executes on them the code in curly braces to its right. Which splits the 9th column into an array using : as a delimiter. In awk arrays are indexed from 1 and the indices are strings (make a note of that).

END is a special match that executes the code in curly braces to its right after we’ve finished reading the input data. So, here is where the printing is done. Using the asorti() function we obtain a new array, indexed based on the values of the indices. We use @ind_num_asc to ensure that the order is 1, 5, 10, 15 and not 1, 10, 15, 5 as it would, should the indices be treated as strings. Finally, we can print the elements from the new array.

This would not be easily possible with awk / nawk, because as the gawk manual says:

In most awk implementations, sorting an array requires writing a sort() function. This can be educational for exploring different sorting algorithms, but usually that’s not the point of the program. gawk provides the built-in asort() and asorti() functions.

Somehow this reminds me of Knuth vs McIlory but of course I am neither.


Years ago, while browsing the original wiki site, I stumbled upon the fizzbuzz test:

“Write a program that prints the numbers from 1 to 100. But for multiples of three print “Fizz” instead of the number and for the multiples of five print “Buzz”. For numbers which are multiples of both three and five print “FizzBuzz”.”

Imagine my surprise when reading that it was designed to make most programming candidates fail the test. From time to time I have the fine opportunity to introduce people who have never coded before in their lives to Python. Within the first two hours they have managed to produce a fizzbuzz version that looks like this:

for i in range(1, 101):
  if i % 3 == 0 and i % 5 == 0:
  elif i % 3 == 0:
  elif i % 5 == 0:

I like the myth that 99.5% of candidates fail fizzbuzz. I tell students that they now can celebrate their achievement. But this is not where I stop with the test. I can now tell them about functions and then have them make one like:

def fizzbuzz(start, stop):

where I have them modify their code above to make it a function. And afterwards they can learn about named parameters with default values:

def fizzbuzz(start=1, stop=100, three=3, five=5):

Notice above how one can change the values of 3 and 5 from the original test and try a different pair. And yes, I leave the functions as exercises to the reader :)

But the best sparks in their eyes come when they remember that I had taught them certain properties of strings some three hours ago, like:

>>> 'fizz' + 'buzz'
>>> 'fizz' * 1
>>> 'fizz' * True
>>> 'fizz' * 0
>>> 'fizz' * False
>>> 'fizz' + ''

So their first observation that 'fizz' + 'buzz' equals 'fizzbuzz' is followed by what is summed up by the table:

String Sum of strings Expanded sum
‘fizzbuzz’ ‘fizz’ + ‘buzz’ ‘fizz’ * True + ‘buzz’ * True
‘fizz’ ‘fizz’ + ” ‘fizz’ * True + ‘buzz’ * False
‘buzz’ ” + ‘fizz’ ‘fizz’ * False + ‘buzz’ * True
” + ” ‘fizz’ * False + ‘buzz’ * False

Which makes them write something like that in the end:

def fizzbuzz(x, y):
  return 'fizz' * x + 'buzz' * y

for i in range(1, 101):
  print(fizzbuzz(i % 3 == 0, i % 5 == 0) or i)

How many things one can learn within a day starting from zero using the humble (and sometimes humbling) fizzbuzz.

A handy configuration snippet that I am using with the nginx ingress controller

One of the most common ways to implement Ingress on Kubernetes is the nginx ingress controller. The nginx ingress controller is configured via annotations that modify the default behavior of the controller. That way for example by using the configuration snipper you can add to the controller nginx directives that would go to a location block on a normal nginx.

In fact whenever I am spinning up an nginx ingress I now always add the following annotation: #deny all;

Whenever I need for some emergency reason or whatever to block incoming traffic to the served site, I can do it immediately with kubectl edit ingress and simply uncommenting the hash, rather than googling that time for the specific annotation name.

PS: If you want to define a whitelist properly, it is best that you use