Merge remote-tracking branch 'haufe/master'

This commit is contained in:
DonMartin76 2016-01-20 11:18:51 +01:00
commit a29d08137c
40 changed files with 814 additions and 16 deletions

View file

@ -1,8 +1,12 @@
## Contributing
Pull Requests from our design, development and operations teams of [Haufe.Group](http://www.haufe-lexware.com) and its subsidiaries are greatly appreciated. If you've never contributed to an open source project before we are more than happy to walk you through how to create a pull request.
Pull Requests from our design, development and operations teams of [Haufe.Group](http://www.haufe-lexware.com) and its subsidiaries are greatly appreciated. If you've never contributed to an open source project before we are more than happy to walk you through how to create a pull request.
If you want to find out more about or want to improve our blog the following links might be a good starting point
The short version of it is to simply clone this repo into (a) a repo of your own account under (b) the name `[your_account_name].github.io`. Upon your first commit the blog should be published at `http://[your_account_name].github.io`. In this way you can add your changes and test them live. When you are finished create a pull request with your changes and submit it against our blog repo.
Support for Categories and Tags were inspired by [this blog entry](http://www.minddust.com/post/tags-and-categories-on-github-pages/). A list of the defined categories and tags can be found at `_data/categories.yml` and `_data\tags.yml` respectively. If you want to add new categories or tags, you need to add them to the corresponding `.yml` file and add the matching template into the `meta/category` or `meta/tag` directories. Please do not go overboard with adding new categories and tags but try to stay within the ones we have. On the other hand - if you feel strongly about adding one, feel free to submit a pull request.
If you want to find out more about using `github-pages` for blogging or want to improve our blog the following links might be good starting points
* [Jekyll documentation, i.e. how to include images](http://jekyllrb.com/docs/posts/)
* [Github pages powered by Jekyll](https://github.com/jekyll/jekyll/wiki/sites)

View file

@ -58,6 +58,9 @@ markdown: redcarpet
highlighter: pygments
permalink: /:title/
# do not publish future posts
future: false
# The release of Jekyll Now that you're using
version: v1.1.0

14
_data/categories.yml Normal file
View file

@ -0,0 +1,14 @@
- slug: general
name: General
- slug: conference
name: Conference
- slug: howto
name: Howto
- slug: api
name: API
- slug: product
name: Product

38
_data/tags.yml Normal file
View file

@ -0,0 +1,38 @@
- slug: security
name: Security
- slug: cloud
name: Cloud
- slug: cto
name: CTO
- slug: culture
name: Culture
- slug: open-source
name: Open Source
- slug: api
name: API
- slug: microservice
name: Microservice
- slug: devops
name: Devops
- slug: docker
name: Docker
- slug: mobile
name: Mobile
- slug: custdev
name: Customer Development
- slug: smartsteuer
name: Smartsteuer
- slug: logging
name: Logging

View file

@ -2,13 +2,47 @@
layout: default
---
{% assign post = page %}
<article class="post">
<h1>{{page.title}}</h1>
<p class='info'>
{% if post.category %}
{% for site_category in site.data.categories %}
{% if site_category.slug == post.category %}
{% assign category = site_category %}
{% endif %}
{% endfor %}
{% if category %}
{% capture category_content %} in <a href="/meta/category/{{ category.slug }}/">{{ category.name }}</a>{% endcapture %}
{% endif %}
{% else %}
{% assign category_content = '' %}
{% endif %}
{% if post.tags.size > 0 %}
{% capture tags_content %} with tags {% endcapture %}
{% for post_tag in post.tags %}
{% for data_tag in site.data.tags %}
{% if data_tag.slug == post_tag %}
{% assign tag = data_tag %}
{% endif %}
{% endfor %}
{% if tag %}
{% capture tags_content_temp %}{{ tags_content }}<a href="/meta/tag/{{ tag.slug }}/">{{ tag.name }}</a>{% if forloop.last == false %}, {% endif %}{% endcapture %}
{% assign tags_content = tags_content_temp %}
{% endif %}
{% endfor %}
{% else %}
{% assign tags_content = '' %}
{% endif %}
<span>{{page.date | date: "%B %d, %Y" }}</span>
/
<span>by {{page.author}}</span>
<span>by {{page.author}}</span>
<br/>
<span>Posted {{ category_content }}{{ tags_content }}</span>
</p>
<hr/>

View file

@ -0,0 +1,44 @@
---
layout: default
---
<article class="page">
<h1>{{page.title}}</h1>
<p class='info'>
<span>Articles by category</span>
</p>
<hr/>
<div id="post-content">
{% if site.categories[page.category] %}
{% for post in site.categories[page.category] %}
{% capture post_year %}{{ post.date | date: '%Y' }}{% endcapture %}
{% if forloop.first %}
<h3>{{ post_year }}</h3><ul>
{% endif %}
{% if forloop.first == false %}
{% assign previous_index = forloop.index0 | minus: 1 %}
{% capture previous_post_year %}{{ site.categories[page.category][previous_index].date | date: '%Y' }}{% endcapture %}
{% if post_year != previous_post_year %}
</div><h3>{{ post_year }}</h3><ul>
{% endif %}
{% endif %}
<li>
<a href="{{ post.url }}/" class="read-more">
{{ post.title }}
</a>
</li>
{% if forloop.last %}
</ul>
{% endif %}
{% endfor %}
{% else %}
<p>There are no posts in this category.</p>
{% endif %}
</div>
</article>

44
_layouts/post_by_tag.html Normal file
View file

@ -0,0 +1,44 @@
---
layout: default
---
<article class="page">
<h1>{{page.title}}</h1>
<p class='info'>
<span>Articles by tag</span>
</p>
<hr/>
<div id="post-content">
{% if site.tags[page.tag] %}
{% for post in site.tags[page.tag] %}
{% capture post_year %}{{ post.date | date: '%Y' }}{% endcapture %}
{% if forloop.first %}
<h3>{{ post_year }}</h3><ul>
{% endif %}
{% if forloop.first == false %}
{% assign previous_index = forloop.index0 | minus: 1 %}
{% capture previous_post_year %}{{ site.tags[page.tag][previous_index].date | date: '%Y' }}{% endcapture %}
{% if post_year != previous_post_year %}
</div><h3>{{ post_year }}</h3><div class="list-group">
{% endif %}
{% endif %}
<li>
<a href="{{ post.url }}/" class="read-more">
{{ post.title }}
</a>
</li>
{% if forloop.last %}
</ul>
{% endif %}
{% endfor %}
{% else %}
<p>There are no posts for this tag.</p>
{% endif %}
</div>
</article>

View file

@ -4,6 +4,7 @@ title: We are live or How to start a developer blog
subtite: A short introduction
description: How to start our developer blog
category: general
tags: [cto, culture]
author: Holger
author_email: holger.reinhardt@haufe-lexware.com
---

View file

@ -3,7 +3,8 @@ layout: post
title: Notes from OSCON Europe 2015
subtite:
description: Notes from OSCON Europe 2015
category: general
category: conference
tags: [open-source]
author: Marco
author_email: marco.seifried@haufe-lexware.com
---

View file

@ -4,6 +4,7 @@ title: The beginnings of our API Journey
subtite: Intro to our API style guide
description: Got API?
category: api
tags: [api]
author: Holger
author_email: holger.reinhardt@haufe-lexware.com
---

View file

@ -1,9 +1,10 @@
---
layout: post
title: Notes from DevOpsCon 2015
title: Notes from DevOpsCon 2015 - Part 1
subtite:
description: Notes from DevOpsCon 2015
category: general
category: conference
tags: [docker, devops]
author: Rainer
author_email: rainer.zehnle@haufe-lexware.com
---

View file

@ -1,9 +1,10 @@
---
layout: post
title: DockerCon EU 2015
title: Impressions from DockerCon 2015 - Part 1
subtite: Insights, Outlooks and Inbetweens
description: Notes from DevOpsCon 2015
category: general
category: conference
tags: [docker, security]
author: ThomasSc
author_email: thomas.schuering@haufe-lexware.com
---

View file

@ -3,7 +3,8 @@ layout: post
title: APIdays Paris - From Philosophy to Technology and back again
subtite: A biased report from APIdays global in Paris
description: APIdays Paris - From Philosophy to Technology and back again
category: general
category: conference
tags: [api]
author: Martin
author_email: martin.danielsson@haufe-lexware.com
---

View file

@ -1,9 +1,10 @@
---
layout: post
title: Impressions from DockerCon 2015
title: Impressions from DockerCon 2015 - Part 2
subtitle: Barcelona 16. + 17.11.2015
description: my highlights and picks from DockerCon 2015
category: docker
category: conference
tags: [docker]
author: Peter Frey
author_email: peter.frey@haufe-lexware.com
---

View file

@ -1,9 +1,10 @@
---
layout: post
title: How-to - Leverage Let's Encrypt on an Azure Ubuntu VM
title: Howto - Leverage 'Let's Encrypt' on Azure
subtitle: Create free valid SSL certificates in 20 minutes.
description: How-to - Leverage Let's Encrypt on an Azure Ubuntu VM
category: how-to
description: Howto - Leverage 'Let's Encrypt' on an Azure Ubuntu VM
category: howto
tags: [security, cloud]
author: Martin
author_email: martin.danielsson@haufe-lexware.com
---

View file

@ -0,0 +1,77 @@
---
layout: post
title: Creating the Smartsteuer 'Snap' App
subtitle: A backstage report
description: A behind the scenes view of the birth of our youngest creation.
category: product
tags: [smartsteuer, mobile, custdev]
author: Eike Hirsch
author_email: eike.hirsch@smartsteuer.de
---
As we at [smartsteuer](https://www.smartsteuer.de) really enjoyed how our [Smartsteuer Smartphone App](https://www.smartsteuer.de/online/steuererklaerung-online/#smartphone) was imagined and eventually created, I thought it might be fun to write about it. This blog post is not that much technical but describes our journey to a product which (hopefully) will create value for our customers.
### Background
At smartsteuer we create tools for people who want to do their tax filings online. For that, we continuously seek for smart solutions to make this task as easy as possible. One example is an app that we created to answer one question which always nags our customers:
> Why should I care? Can I even expect a refund?
To answer this seriously you have to do a whole lot of calculations for which you need quite some information from the user. Which in turn would create a process which is _not_ fast and easy. So, some years ago we created this app which would do two things:
1. Only asks for about five things every user knows of the top of their heads.
2. Make some educated guesses to answer all the other questions with rough estimates.
The result couldn't be exact but it was good enough to answer said question. It worked quite well even though you still had to provide those five figures.
### Theory
Now, with the help and cooperation of our fellow colleagues from "Haufe Lohn & Gehalt" we wanted to take the app to the next level. It was our aim to reduce the number of questions the user needs to answer and at the same time increase the accuracy of the calculation. I will spare you the details but the result of our efforts was a QR-Code which every user of "Haufe Lohn & Gehalt" would get and which would contain all wage and tax information an employee needs to file her taxes.
So the plan was to enhance the app with a qr-code scanner to safe the user some typing.
We created a quick briefing for our mobile dev agency - they returned an offer - we signed it - the deal was sealed.
You might wonder why I am writing the blog post in the first place, as this sounds all to familiar and is in any regards special. Well you are right. Up until here this story is _only_ an example of solid work. But please bare with me and read on.
### Reality kicks in
About an hour before the agency would come by to kick the project off I was holding an internal meeting to get everybody on the same page. During this meeting it came to light that the project somehow managed to stay under the radar and that everyone in the room did not know about it. This is quite uncommon in our company as everybody is eager to know what is going on and to contribute her ideas and we encourage everyone to do so. But in this particular case this somehow did not happen until said meeting.
And so it was this meeting when all the experts where questioning the new feature and its purpose:
> Why are we doing this?
> What is the benefit for the user?
> Is the benefit big enough to justify the work?
> What data is included in the qr-code?
> Is this really the best we can do for our customers?
> …
It turned out that, while we would get a lot more data to replace some of our guesses with real values, the user would still need to answer four out of the former five questions and instead would need to turn on the scanner and snap the code. That was not the benefit we hoped to deliver.
### Adaption
Luckily we did not stop there. When you happen to have a bunch of smart people in the room, new ideas come up and so a totally new app slowly came into shape.
**What can we do with that qr-code?** It contained lots of data which the user would need to manually enter into her tax filing - a tedious and error-prone process.
**But our main product - the tax filing software - runs in the browser on desktop-PCs.** You don't normally scan qr-codes with an desktop-PC.
**What if we could transform the qr-code-scanner into an input device for our software?** We would need to find a way to link the app with the software without needing the user to do some fancy stuff or even worse needing to understand the whole process. And at the same time keeping her data safe and protected.
**Can't we create a second qr-code which contains the data needed for the linking?**
**And why not use OCR to read any other document**
By the time the agency arrived we had totally rewritten the plan. And they had no idea…
### Outcome
Well, we had to start the meeting with a lot of apologies. The app we original signed up for was from the table but we still wanted *an* app. Luckily our agency was flexible enough to adapt to the new plan and within only one week we had a working prototype. From that day on everything worked according to the plan and now our Smartsteuer App is available at the [IOS App Store](https://itunes.apple.com/de/app/smartsteuer/id1068423226?mt=8) and will be very soon be in the Android Play store as well (We will post the link as soon as its up). Check it out if you like and let me now what you think.
Finally I'd like to give a big shout out to our colleagues at [Haufe-Lexware](http://haufe-lexware.com) and to our agency [Wissenswerft](http://wissenswerft.net) for the great teamwork and flexibility!

View file

@ -1,9 +1,10 @@
---
layout: post
title: DevOpsCon Munich - Is it really about the tools?
title: DevOpsCon 2015 - Is it really about the tools?
subtite: A opinionated commentary by Elias Weingaertner
description: My personal findings from DevOpsCon 2015 in Munich
category: general
category: conference
tags: [devops, microservice]
author: Elias Weingaertner
author_email: elias.weingaertner@haufe-lexware.com
---

View file

@ -0,0 +1,204 @@
---
layout: post
title: Howto - Log Aggregation with Fluentd, Elasticsearch and Kibana
subtite: A short introduction
description: Introduction to log aggregation using Fluentd, Elasticsearch and Kibana
category: howto
tags: [devops, docker, logging]
author: Doru Mihai
author_email: doru.mihai@haufe-lexware.com
---
With an increasing number of systems decoupled and scattered throughout the landscape it becomes increasingly difficult to track and trace events across all systems.
Log aggregation solutions provides a series of benefits to distributed systems.
The problems it tackles are:
- Centralized, aggregated view over all log events
- Normalization of log schema
- Automated processing of log messages
- Support for a great number of event sources and outputs
One of the most prolific open source solutions on the market is the [ELK stack](https://www.elastic.co/videos/introduction-to-the-elk-stack) created by Elastic.
<center>
<img src="/images/logaggregation-elk.png" width=50% align="middle"/>
</center>
ELK stands for Elasticsearch Logstash Kibana and they are respectively their Search engine, Log Shipper and Visualization frontend solutions.
Elasticsearch becomes the nexus for gathering and storing the log data and it is not exclusive to Logstash.
Another very good data collection solution on the market is Fluentd, and it also supports Elasticsearch (amongst others) as the destination for its gathered data. So using the same data repository and frontend solutions, this becomes the EFK stack and if you do a bit of searching you will discover many people have chosen to substitute Elastic's logstash with FluentD and we will talk about why that is in a minute.
<center>
<img src="/images/logaggregation-efk.png" width=20% align="middle"/>
</center>
# Logstash vs FluentD
Both of them are very capable, have [hundreds](https://www.elastic.co/guide/en/logstash/current/input-plugins.html) and [hundreds](http://www.fluentd.org/plugins) of plugins available and are being maintained actively by corporation backed support.
### Technology - Fluentd wins
The big elephant in the room is that Logstash is written in JRuby and FluentD is [written in Ruby with performance sensitive parts in C](http://www.fluentd.org/faqs). As a result the overhead of running a JVM for the log shipper translates in large memory consumption, especially when you compare it to the footprint of Fluentd. The only advantage that Logstash can still invoke is the good parallelism support that the JVM brings and very good [Grok](https://www.elastic.co/guide/en/logstash/current/plugins-filters-grok.html) support.
The only downside for Fluentd was the lack of support for Windows, but even that has been [solved](https://github.com/fluent/fluentd/pull/674) and [grok](https://github.com/kiyoto/fluent-plugin-grok-parser) support is also available for Fluentd and you can even re-use the grok libraries you had used/built, even [Logstash grok patterns](https://github.com/elastic/logstash/tree/v1.4.2/patterns).
### Shippers - Fluentd wins
They both however offer the option of deploying lightweight components that will only read and send the log messages to a fully fledged instance that will do the necessary processing. These are called log forwarders and both have lightweight forwarders written in Go. As of this writing Elastic has released a replacement for it's [logstash-forwarder](https://github.com/elastic/logstash-forwarder)(formerly called Lumberjack) and it is built on top of it's new Data shipper platform [Beats](https://www.elastic.co/products/beats), and it is called [Filebeat](https://github.com/elastic/beats/tree/master/filebeat).
This new Logstash forwarder allows for a TLS secured communication with the log shipper, something that the old one was not capable of but it is still lacking a very valuable feature that fluentd offers, and that is buffering.
### Resiliency - Fluentd wins
As mentioned previously fluentd offers buffering, something that you get "for free" and coupled with the active client-side load balancing you get a very competent solution without a large footprint.
On the other side, logstash doesn't have buffering and only has an in-memory queue of messages that is fixed in length (20 messages) so in case messages can't get through, they are lost. To alleviate this weakness the common practice is to set up an external queue (like [Redis](http://www.logstash.net/docs/1.3.2//tutorials/getting-started-centralized)) for persistence of the messages in case something goes wrong at either end. They are [working on it](https://github.com/elastic/logstash/issues/2605) though, so in the future we might see an improvement in this area.
Fluentd offers in-memory or file based buffering coupled with [active-active and active-standby load balancing and even weighted load balancing](http://docs.fluentd.org/articles/high-availability) and last but not least it also offers [at-most-once and at-least-once](http://docs.fluentd.org/articles/out_forward#requireackresponse) semantics.
# Additional considerations
Logstash benefits from a more chiselled, mature implementation due to the fact that the core and a lot of the essential plugins are maintained by Elastic, and some may argue that it's easier to deploy a JRE and the logstash jar and be done with it while others would consider it overkill to have a JVM running for such a small task. Plus the need to deploy and maintain a separate queueing
Fluentd provides just the core and a couple of input/output plugins and filters and the rest of the large number of plugins available are community driven and so you are exposed to the risk of potential version incompatibilities and lack of documentation and support.
I have personally seen that there is a bit of chaos since each plugin creator will define his own set of configuration input variables and there isn't a sense of consistency when you look at different plugins. You will encounter variables that are optional and have different default values, variables that are not properly documented but you can deduct their usage from the examples that the author offers and virtually all known naming conventions will appear in your config file.
# What next?
Well, as you can probably already tell, I have chosen to go with fluentd, and as such it became quickly apparent that I need to integrate it with Elasticsearch and Kibana to have a complete solution, and that wasn't a smooth ride due to 2 issues:
- Timestamps were sent to Elasticsearch without milliseconds
- All field values were by default analyzed fields
For communicating with Elasticsearch I used the plugin [fluent-plugin-elasticsearch](https://github.com/uken/fluent-plugin-elasticsearch) as presented in one of their very helpful [use case tutorials](http://docs.fluentd.org/articles/free-alternative-to-splunk-by-fluentd).
This plugin allows fluentd to impersonate logstash by just enabling the setting `logstash-format` in the configuration file. I snooped arround a bit and found that basically the only difference is that the plugin will make sure that the message sent has a timestamp field named `@timestamp`.
And here we arrive at our first problem....
### Timestamp fix
This is a pain because if you want to properly visualize a set of log messages gathered from multiple systems, in sequence, to be able to see exactly what step followed the other.....well, you see the problem.
Let's take a look at what fluentd sends to Elasticsearch. Here is a sample log file with 2 log messages:
~~~java
2015-11-12 06:34:01,471 [ ajp-apr-127.0.0.1-8009-exec-3] LogInterceptor INFO ==== Request ===
2015-11-12 06:34:01,473 [ ajp-apr-127.0.0.1-8009-exec-3] LogInterceptor INFO GET /monitor/broker/ HTTP/1.1
~~~
A message sent to Elasticsearch from fluentd would contain these values:
*-this isn't the exact message, this is the result of the stdout output plugin-*
~~~ruby
2015-11-12 06:34:01 -0800 tag.common: {"message":"[ ajp-apr-127.0.0.1-8009-exec-3] LogInterceptor INFO ==== Request ===","time_as_string":"2015-11-12 06:34:01 -0800"}
2015-11-12 06:34:01 -0800 tag.common: {"message":"[ ajp-apr-127.0.0.1-8009-exec-3] LogInterceptor INFO GET /monitor/broker/ HTTP/1.1\n","time_as_string":"2015-11-12 06:34:01 -0800"}
~~~
I added the `time_as_string` field in there just so you can see the literal string that is sent as the time value.
This is a known issue and initially it was the fault of fluentd for not supporting that level of granularity, but is had been [fixed](https://github.com/fluent/fluentd/issues/461). Sadly, the fix has not made it's way to the Elasticsearch plugin and so, [alternatives have appeared](https://github.com/shivaken/fluent-plugin-better-timestamp).
The fix basically involves manually formatting the `@timestamp` field to have the format `YYYY-MM-ddThh:mm:ss.SSSZ`. So you can either bring on the previously mentioned `fluent-plugin-better-timestamp` into your log processing pipeline to act as a filter that fixes your timestamps OR you can build it yourself.
In order to build it yourself you only need the `record_transformer` filter that is part of the core of plugins that fluentd comes with and that I anyway would recommend you use for enriching your messages with things like the source hostname for example.
Next you need to parse the timestamp of your logs into separate date, time and millisecond components (which is basically what the better-timestamp plugin asks you to do, to some extent), and then to create a filter that would match all the messages you will send to Elasticsearch and to create the `@timestamp` value by appending the 3 components. This makes use of the fact that fluentd also allows you to run ruby code within your record_transformer filters to accommodate for more special log manipulation tasks.
~~~xml
<filter tag.**>
type record_transformer
enable_ruby true
<record>
@timestamp ${date_string + "T" + time_string + "." + msec + "Z"}
</record>
</filter>
~~~
The result is that the above sample will come out like this:
~~~ruby
2015-12-12 05:26:15 -0800 akai.common: {"date_string":"2015-11-12","time_string":"06:34:01","msec":"471","message":"[ ajp-apr-127.0.0.1-8009-exec-3] LogInterceptor INFO ==== Request ===","@timestamp":"2015-11-12T06:34:01.471Z"}
2015-12-12 05:26:15 -0800 akai.common: {"date_string":"2015-11-12","time_string":"06:34:01","msec":"473","message":"[ ajp-apr-127.0.0.1-8009-exec-3] LogInterceptor INFO GET /monitor/broker/ HTTP/1.1\n","@timestamp":"2015-11-12T06:34:01.473Z"}
~~~
*__Note__: you can use the same record_transformer filter to remove the 3 separate time components after creating the `@timestamp` field via the `remove_keys` option.*
### Do not analyse
There are 2 reasons why you shouldn't want your fields to be analyzed in this scenario:
- It will potentially increase the storage requirements
- It will make it impossible to do proper analysis and visualization on your data if you have field values that contain hyphens, dots or others.
Ok, so first, why does it increase the storage requirements?
Well, while researching what would be the proper hardware sizing requirements for setting up our production EFK installation I stumbled upon [this](http://peter.mistermoo.com/2015/01/05/hardware-sizing-or-how-many-servers-do-i-really-need/) post that goes in detail about what and why and how big can the problem become.
Worst case scenario, you could be using up to **40% more** disk space than you really need. Pretty bad huh?
And the second issue which would become apparent much quicker than the first is that when you will try to make use of Kibana to visualize your data you will encounter the issue that fields that contain hyphens for example will appear split and duplicate when used in visualizations.
For instance, by using the record_transformer I would send the hostname and also a statically specified field called `sourceProject`, to be able to group together messages that came from different identical instances of a project application.
Using this example configuration I tried to create a pie chart showing the number of messages per project for a dashboard. Here is what I got.
~~~ruby
<filter tag.**>
type record_transformer
enable_ruby true
<record>
@timestamp ${date_string + "T" + time_string + "." + msec + "Z"}
sourceProject Test-Analyzed-Field
</record>
</filter>
~~~
Sample output from stdout:
~~~ruby
2015-12-12 06:01:35 -0800 clear: {"date_string":"2015-10-15","time_string":"06:37:32","msec":"415","message":"[amelJettyClient(0xdc64419)-706] jetty:test/test INFO totallyAnonymousContent: http://whyAreYouReadingThis?:)/history/3374425?limit=1","@timestamp":"2015-10-15T06:37:32.415Z","sourceProject":"Test-Analyzed-Field"}
~~~
And here is the result of trying to use it in a visualization:
<center>
<img src="/images/logaggregation-analyzed-field.png" width=35% align="middle"/>
</center>
I should mention, what you are seeing is the result of 6 messages that all have the field sourceProject set to the value "Test-Analyzed-Field".
Sadly, once you put some data into Elasticsearch, indices are automatically created (by the fluent-plugin-elasticsearch) and mappings along with them and once a field is mapped as being analyzed [it cannot be changed](https://www.elastic.co/blog/changing-mapping-with-zero-downtime).
Curiously this did not happen when using Logstash, which made me look into how they are handling this problem. Then I discovered the issue was discussed also in the context of the fluent-plugin-elasticsearch and [the solution was posted there](https://github.com/uken/fluent-plugin-elasticsearch/issues/33) along with the request to include it in future versions of the plugin.
And the solution is: When Elasticsearch creates a new index, it will rely on the existence of a template to create that index. Logstash comes with a template of its own that it uses to tell Elasticsearch to create not analyzed copies of the fields it sends to it so that users can benefit from the analyzed fields for searching and the not analyzed fields when doing visualizations. And that template can be found [here](https://github.com/logstash-plugins/logstash-output-elasticsearch/blob/master/lib/logstash/outputs/elasticsearch/elasticsearch-template.json).
And what you basically need to do is to do a curl put with that json content to ES and then all the indices created that are prefixed with `logstash-*` will use that template. Be aware that with the fluent-plugin-elasticsearch you can specify your own index prefix so make sure to adjust the template to match your prefix:
~~~bash
curl -XPUT localhost:9200/_template/template_doru -d '{
"template" : "logstash-*",
"settings" : {....
}'
~~~
The main thing to note in the whole template is this section:
~~~json
"string_fields" : {
"match" : "*",
"match_mapping_type" : "string",
"mapping" : {
"type" : "string", "index" : "analyzed", "omit_norms" : true,
"fielddata" : { "format" : "disabled" },
"fields" : {
"raw" : {"type": "string", "index" : "not_analyzed", "doc_values" : true, "ignore_above" : 256}
}
}
}
~~~
This tells Elasticsearch that for any field of type string that it receives it should create a mapping of type string that is analyzed + another field that adds a `.raw` suffix that will not be analyzed.
The `not_analyzed` suffixed field is the one you can safely use in visualizations, but do keep in mind that this creates the scenario mentioned before where you can have up to 40% inflation in storage requirements because you will have both analyzed and not_analyzed fields in store.
# Have fun
So, now you know what we went through here at [HaufeDev](http://haufe-lexware.github.io/) and what problems we faced and how we can overcome them.
If you want to give it a try you can take a look at [our docker templates on github](https://github.com/Haufe-Lexware/docker-templates), there you will find a [logaggregation template](https://github.com/Haufe-Lexware/docker-templates/tree/master/logaggregation) for an EFK setup + a shipper that can transfer messages securely to the EFK solution and you can have it up and running in a matter of minutes.

View file

@ -0,0 +1,219 @@
---
layout: post
title: Howto - Configuring Log Parsing with Fluentd
subtite: Approaches to log parsing
description: Description of a couple of approaches to designing your fluentd configuration.
category: howto
tags: [devops, logging]
author: Doru Mihai
author_email: doru.mihai@haufe-lexware.com
---
When you will start to deploy your log shippers to more and more systems you will encounter the issue of adapting your solution to be able to parse whatever log format and source each system is using. Luckily, fluentd has a lot of plugins and you can approach a problem of parsing a log file in different ways.
The main reason you may want to parse a log file and not just pass along the contents is that when you have multi-line log messages that you would want to transfer as a single element rather than split up in an incoherent sequence.
Another reason would be log files that contain multiple log formats that you would want to parse into a common data structure for easy processing.
Below I will enumerate a couple of strategies that can be applied for parsing logs.
And last but not least, there is the case that you have multiple log sources (perhaps each using a different technology) and you want to parse them and aggregate all information to a common data structure for coherent analysis and visualization of the data.
## One Regex to rule them all
The simplest approach is to just parse all messages using the common denominator. This will lead to a very black-box type approach to your messages deferring any parsing efforts to a later time or to another component further downstream.
In the case of a typical log file a configuration can be something like this (but not necessarily):
~~~xml
<source>
type tail
path /var/log/test.log
read_from_head true
tag test.unprocessed
format multiline
format_firstline /\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2},\d{3}/
#we go with the most generic pattern where we know a message will have
#a timestamp in front of it, the rest is just stored in the field 'message'
format1 /(?<time>\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2},\d{3}) (?<message>(.|\s)*)/
</source>
~~~
You will notice we still do a bit of parsing, the minimal level would be to just have a multiline format to split the log contents into separate messages and then to push the contents on.
The reason we do not just put everything into a single field with a greedy regex pattern is to have the correct timestamp pushed showing the time of the log and not the time when the log message was read by the log shipper, along with the rest of the message.
If more pieces are common to all messages, it can be included in the regex for separate extraction, if it is of interest of course.
## Divide & Conquer
As the name would suggest, this approach suggests that you should try to create an internal routing that would allow you to precisely target log messages based on their content later on downstream.
An example of this is shown in the configuration below:
~~~ruby
#Sample input:
#2015-10-15 08:19:05,190 [testThread] INFO testClass - Queue: update.testEntity; method: updateTestEntity; Object: testEntity; Key: 154696614; MessageID: ID:test1-37782-1444827636952-1:1:2:25:1; CorrelationID: f583ed1c-5352-4916-8252-47298732516e; started processing
#2015-10-15 06:44:01,727 [ ajp-apr-127.0.0.1-8009-exec-2] LogInterceptor INFO user-agent: check_http/v2.1.1 (monitoring-plugins 2.1.1)
#connection: close
#host: test.testing.com
#content-length: 0
#X-Forwarded-For: 8.8.8.8
#2015-10-15 08:21:04,716 [ ttt-grp-127.0.0.1-8119-test-11] LogInterceptor INFO HTTP/1.1 200 OK
<source>
type tail
path /test/test.log
tag log.unprocessed
read_from_head true
format multiline
format_firstline /\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2},\d{3}/
format1 /(?<time>\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2},\d{3}) (?<message>(.|\s)*)/
</source>
<match log.unprocessed.**>
type rewrite_tag_filter
rewriterule1 message \bCorrelationID\b correlation
rewriterule2 message .* clear
</match>
<match clear>
type null
</match>
<filter correlation>
type parser
key_name message
format / * (.*method:) (?<method>[^;]*) *(.*Object:) (?<object>[^;]*) *(.*Key:) (?<objectkey>[^;]*) *(.*MessageID:) (?<messageID>[^;]*) *(.*CorrelationID:) (?<correlationID>[^;]*).*/
reserve_data yes
</filter>
<match correlation>
type stdout
</match>
~~~
This approach is useful when we have multiline log messages within our logfile and the messages themselves have different formats for the content. Still, the important thing to note is that all log messages are prefixed by a standard timestamp, this is key to succesfully splitting messages correctly.
The break-down of the approach with the configuration shown is that all entries in the log are first parsed into individual events to be processed. The key separator here is the timestamp and it is marked by the *format_firstline* key/value pair as a regex pattern.
Fluentd will continue to read logfile lines and keep them in a buffer until a line is reached that starts with text that matches the regex pattern specified in the *format_firstline* field. After detecting a new log message, the one already in the buffer is packaged and sent to the parser defined by the regex pattern stored in the format<n> fields.
Looking at the example, all our log messages (single or multiline) will take the form:
~~~json
{ "time":"2015-10-15 08:21:04,716", "message":"[ ttt-grp-127.0.0.1-8119-test-11] LogInterceptor INFO HTTP/1.1 200 OK" }
~~~
Being tagged with log.unprocessed all the messages will be caught by the *rewrite_tag_filter* match tag and it is at this point that we can pinpoint what type of contents each message has and we can re-tag them for individual processing.
This module is key to the whole mechanism as the *rewrite_tag_filter* takes the role of a router. You can use this module to redirect messages to different processing modules or even outputs depending on the rules you define in it.
## Shooting fish in a barrel
You can use *fluent-plugin-multi-format-parser* to try to match each line read from the log file with a specific regex pattern (format).
This approach probably comes with performance drawbacks because fluentd will try to match using each regex pattern sequentially until one matches.
An example of this approach can be seen below:
~~~ruby
<source>
type tail
path /var/log/aka/test.log
read_from_head true
keep_time_key true
tag akai.unprocessed
format multi_format
# 2015-10-15 08:19:05,190 [externalPerson]] INFO externalPersonToSapSystem - Queue: aka.update.externalPerson; method: ; Object: externalPerson; Key: ; MessageID: ID:test1-37782-1444827636952-1:1:2:25:1; CorrelationID: f583ed1c-5352-4916-8252-47698732506e; received
<pattern>
format /(?<time>\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2},\d{3}) \[(?<thread>.*)\] (?<loglevel>[A-Z]*) * (.*method:) (?<method>[^;]*) *(.*Object:) (?<object>[^;]*) *(.*Key:) (?<objectkey>[^;]*) *(.*MessageID:) (?<messageID>[^;]*) *(.*CorrelationID:) (?<correlationID>[^;]*); (?<status>.*)/
</pattern>
# 2015-10-13 12:30:18,475 [ajp-apr-127.0.0.1-8009-exec-14] LogInterceptor INFO Content-Type: text/xml; charset=UTF-8
# Authorization: Basic UFJPRE9NT1xyZXN0VGVzdFVzZXI6e3tjc2Vydi5wYXNzd29yZH19
# breadcrumbId: ID-haufe-prodomo-stage-51837-1444690926044-1-1731
# checkoutId: 0
# Content-Encoding: gzip
# CS-Cache-Minutes: 0
# CS-Cache-Time: 2015-10-13 12:30:13
# CS-Client-IP: 172.16.2.51
# CS-Inner-Duration: 207.6 ms
# CS-Outer-Duration: 413.1 ms
# CS-Project: PRODOMO
# CS-UserID: 190844
# CS-UserName: restTestUser
# Expires: Thu, 19 Nov 1981 08:52:00 GMT
# Server: Apache
# SSL_CLIENT_S_DN_OU: Aka-Testuser
# User-Agent: check_http/v2.1.1 (monitoring-plugins 2.1.1)
# Vary: Accept-Encoding
# workplace: 0
# X-Forwarded-For: 213.164.69.219
# X-Powered-By: PHP/5.3.21 ZendServer/5.0
# Content-Length: 2883
# Connection: close
<pattern>
format multiline
format_firstline /\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2},\d{3}/
format1 /(?<time>\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2},\d{3}) \[(?<thread>.*)\] (?<class>\w*) * (?<loglevel>[A-Z]*) (?<message>.*)/
</pattern>
#Greedy default format
<pattern>
format /(?<time>\d{4}-\d{1,2}-\d{1,2} \d{1,2}:\d{1,2}:\d{1,2},\d{3}) (?<message>(.|\s)*)/
</pattern>
</source>
~~~
When choosing this path there are multiple issues you need to be aware of:
* The pattern matching is done sequentially and the first pattern that matches the message is used to parse it and the message is passed along
* You need to make sure the most specific patterns are higher in the list and the more generic ones lower
* Make sure to create a very generic pattern to use as a default at the end of the list of patterns.
* Performance will probably decrease due to the trial&error approach to finding the matching regex
The biggest issue with this approach is that it is very very hard to handle multi-line log messages if there are significantly different log syntaxes in the log.
__Warning:__ Be aware that the multiline parser continues to store log messages in a buffer until it matches another firstline token and when it does it then it packages and emits the multiline log it collected.
This approach is useful when you have good control and know-how about the format of your log source.
## Order & Chaos
Introducing Grok!
Slowly but surely getting all your different syntaxes, for which you will have to define different regular expressions, will make your config file look very messy, filled with regex-es that are longer and longer, and just relying on the multiple format lines to split it up doesn't bring that much readability nor does it help with maintainability. Reusability is something that we cannot even discuss in the case of pure regex formatters.
Grok allows you to define a library of regexes that can be reused and referenced via identifiers. It is structured as a list of key-value pairs and can also contain named capture groups.
An example of such a library can be seen below. (Note this is just a snippet and does not contain all the minor expressions that are referenced from within the ones enumerated below)
~~~ruby
###
# AKA-I
###
# Queue: aka.update.externalPerson; method: updateExternalPerson; Object: externalPerson; Key: ; MessageID: ID:test1-37782-1444827636952-1:1:2:25:1; CorrelationID: f583ed1c-5352-4916-8252-47698732506e; received
AKA_AKAI_CORRELATIONLOG %{AKA_AKAI_QUEUELABEL} %{AKA_AKAI_QUEUE:queue}; %{AKA_AKAI_METHODLABEL} %{AKA_AKAI_METHOD:method}; %{AKA_AKAI_OBJECTLABEL} %{AKA_AKAI_OBJECT:object}; %{AKA_AKAI_KEYLABEL} %{AKA_AKAI_KEY:key}; %{AKA_AKAI_MESSAGEIDLABEL} %{AKA_AKAI_MESSAGEID:messageId}; %{AKA_AKAI_CORRELATIONLABEL} %{AKA_AKAI_CORRELATIONID:correlationId}; %{WORD:message}
#CorrelationId log message from AKAI
#2015-10-15 08:19:05,190 [externalPerson]] INFO externalPersonToSapSystem - Queue: aka.update.externalPerson; method: updateExternalPerson; Object: externalPerson; Key: ; MessageID: ID:test1-37782-1444827636952-1:1:2:25:1; CorrelationID: f583ed1c-5352-4916-8252-47698732506e; received
AKA_AKAI_CORRELATIONID %{AKAIDATESTAMP:time} %{AKA_THREAD:threadName} %{LOGLEVEL:logLevel} * %{JAVACLASS:className} %{AKA_AKAI_CORRELATIONLOG}
#Multiline generic log pattern
# For detecting that a new log message has been read we will use AKAIDATESTAMP as the pattern and then match with a greedy pattern
AKA_AKAI_GREEDY %{AKAIDATESTAMP:time} %{AKA_THREAD:threadName} %{JAVACLASS:class} * %{LOGLEVEL:loglevel} %{AKA_GREEDYMULTILINE:message}
# Variation since some log messages have loglevel before classname or vice-versa
AKA_AKAI_GREEDY2 %{AKAIDATESTAMP:time} %{AKA_THREAD:threadName} * %{LOGLEVEL:loglevel} %{JAVACLASS:class} %{AKA_GREEDYMULTILINE:message}
###
# ARGO Specific
###
AKA_ARGO_DATESTAMP %{MONTHDAY}-%{MONTH}-%{YEAR} %{TIME}
#17-Nov-2015 07:53:38.786 INFO [www.lexware-akademie.de-startStop-1] org.apache.catalina.startup.HostConfig.deployDirectory Deployment of web application directory /var/www/html/www.lexware-akademie.de/ROOT has finished in 44,796 ms
AKA_ARGO_LOG %{AKA_ARGO_DATESTAMP:time} %{LOGLEVEL:logLevel} %{AKA_THREAD:threadName} %{JAVACLASS:className} %{AKA_GREEDYMULTILINE:message}
#2015-11-17 07:53:51.606 RVW INFO AbstractApplicationContext:515 - Refreshing Root WebApplicationContext: startup date [Tue Nov 17 07:53:51 CET 2015]; root of context hierarchy
AKA_ARGO_LOG2 %{AKAIDATESTAMP2:time} %{WORD:argoComponent} *%{LOGLEVEL:logLevel} * %{AKA_CLASSLINENUMBER:className} %{AKA_GREEDYMULTILINE:message}
#[GC (Allocation Failure) [ParNew: 39296K->4351K(39296K), 0.0070888 secs] 147064K->114083K(172160K), 0.0071458 secs] [Times: user=0.01 sys=0.00, real=0.01 secs]
#[CMS-concurrent-abortable-preclean: 0.088/0.233 secs] [Times: user=0.37 sys=0.02, real=0.23 secs]
AKA_ARGO_SOURCE (GC|CMS)
AKA_ARGO_GC \[%{AKA_ARGO_SOURCE:source} %{AKA_GREEDYMULTILINE:message}
~~~
To use Grok you will need to install the *fluent-plugin-grok-parser* and then you can use grok patterns with any of the other techniques previously described with regex: Multiline, Multi-format.
# Go Get'em!
Now you should have a pretty good idea of how you can approach different log formats and how you can structure your config file using a couple of plugins from the hundreds of plugins available.

Binary file not shown.

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 126 KiB

6
meta/category/api.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_category
title: API
category: api
permalink: /meta/category/api/
---

View file

@ -0,0 +1,6 @@
---
layout: post_by_category
title: Conference
category: conference
permalink: /meta/category/conference/
---

6
meta/category/general.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_category
title: General
category: general
permalink: /meta/category/general/
---

6
meta/category/howto.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_category
title: Howto
category: howto
permalink: /meta/category/howto/
---

6
meta/category/product.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_category
title: Product
category: product
permalink: /meta/category/product/
---

6
meta/tag/api.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: API
tag: api
permalink: /meta/tag/api/
---

6
meta/tag/cloud.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Cloud
tag: cloud
permalink: /meta/tag/cloud/
---

6
meta/tag/cto.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: CTO
tag: cto
permalink: /meta/tag/cto/
---

6
meta/tag/culture.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Culture
tag: culture
permalink: /meta/tag/culture/
---

6
meta/tag/custdev.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Customer Development
tag: custdev
permalink: /meta/tag/custdev/
---

6
meta/tag/devops.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Devops
tag: devops
permalink: /meta/tag/devops/
---

6
meta/tag/docker.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Docker
tag: docker
permalink: /meta/tag/docker/
---

6
meta/tag/logging.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Logging
tag: logging
permalink: /meta/tag/logging/
---

6
meta/tag/microservice.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Microservice
tag: microservice
permalink: /meta/tag/microservice/
---

6
meta/tag/mobile.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Mobile
tag: mobile
permalink: /meta/tag/mobile/
---

6
meta/tag/open-source.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Open Source
tag: open-source
permalink: /meta/tag/open-source/
---

6
meta/tag/security.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Security
tag: security
permalink: /meta/tag/security/
---

6
meta/tag/smartsteuer.md Normal file
View file

@ -0,0 +1,6 @@
---
layout: post_by_tag
title: Smartsteuer
tag: smartsteuer
permalink: /meta/tag/smartsteuer/
---