My Notes taken at Berlin Buzzwords


Keynote – Grant Ingersoll – Lucid Imagination about Lucene

  • @gsingers
  • SolrCloud == ZooKeeper + Solr
  • Map/Reduce Ready recommenders available
  • Identify Topics
  • Latent Dirichlet Allocation ==>
  • Frequent Pattern Mining
  • Clustering
  • K-Means, Dirichlet, Canopy, etc
  • Carrot^2 Document and Search Result Clustering =>
  • Eigent Cuts (spectral Clustering) =>
  • Location Aware Search results
  • Query Parsing
  • Filtering
  • Boosting
  • Sorting
  • Singular Value Decomposition (SVD) =>
  • technique for reducing the dimensionaltity of large matrices while retaining the core features of the larger space
  • Latent Semantic Analysis uses SVD to provide search over the reduced space =>
  • Named Entity Recognition =>
  • Finite-State Queries in Lucene

Lucene Forecast – Version, Unicode, Flex, Modules by Simon Willnauer

  • switched to Kino 10

Making Software for Humans: CouchDB

  • Jan Lehnardt
  • a one-size-fits-all solution for scaling out is not really possible at all @janl #berlinbuzzwords

“Text and Metadata extraction with Apache Tika

  • switched to Riak Talk




  • best talk so far

Elastic Search

  • distributed
  • completly HTTP based
  • range queries possible
  • json based
  • lucene based
  • filters: faster than queries (cachable)
  • Near Realtime Search available

    NeoJ – Peter Neubauer

  • Nodes
  • Relationsships between nodes
  • properties on Nodes and props
  • Traversal Framework
  • Lucene integrated (indexing done on commit)

Riak Search

  • different query type patterns – keybased
  • Consistent Hashing and Partitions
  • Optimizations (to avoid “Obama Problem = Hotspots in the ring e.g. many documents containing the word Obama all going to the same node”)
  • Bloom Filters & Caching
  • Batching to sae query-time & index-time bandwith



  • ZeroMQ – 0MQ –
  • “the fewest possible moving pieces” -> no service discovery.
  • No Broker

Hypertable –

  • used by Baidu search engine
  • Dynamo – used by Amazon for Shopping Cart (uses Read-Repair and Consistent Hashing)
  • during periods of failure there can be latency spikes, because of single machine handling the ranges / machines
  • LSM – Log Structured Merge Tre
  • eleminates random I/O => holding a log tree strucutre in memory which will be written to disk asynchronously (compaction)
  • uses Bloom Filters => helps to avoid disk seeks by running every key through bloom filter to determine in which file the k/v is in
  • Dynamic Memeory Adjustment based on Workload
  • Hypertable vs. HBase
  • 70% faster than hBase (seq. write / seq. read)
  • one of the reasons: Dynamic Memory Adjustment
  • You can run Map/Reduce (Hadoop) and it is Data Locality Aware
  • Query types: timestamp ranges, versions of the cell possible, but mainly primary key access
  • No delete operation currenlty (but each column family has a TTL)

HDFS Deep Dive

  • one name node, multiple data nodes
  • use 0.20.2-append branch
  • NameNode SPOF in theory, but not in practise

Cassandra Talk – Eric Evans

  • Vectorclocks =>

Massivly Parallel Analytics beyond Map/Reduce

  • Context: StratoSphere Project

Colaborative Filtering Mahout – Frank Scholten

  • is a machine learning lib for java
  • run on hadoop
  • Similarity Algorithms
  • TanimotoCoefficintSimilarity
  • LoglikelihoodSimilarity

Sqoop Database import/export for hadoop

  • wish every speaker would have had logically structured their talks as Aaron Kimball did with his Hadoop talks


  • parser/optimizer/compiler that translates HiveQL into MapReduce code
  • metastore that stores “schema” information (e.g. table name, column names, data types)
  • schema on read, not write
  • unbedingt anschauen!!!
  • WHERE => map
  • GROUP BY/ORDER BY => reduce
  • JOIN => map or reduce depending on optimizer
  • there is an HiveQL EXPLAIN command
  • python lernen
Dieser Beitrag wurde unter Software-Development abgelegt und mit , , , verschlagwortet. Setze ein Lesezeichen auf den Permalink.

3 Antworten auf My Notes taken at Berlin Buzzwords

  1. Pingback: casino online

  2. Marta Medina sagt:

    Mejor redactado imposible. Muchas gracias. Besos!!

  3. Una información de lo más interesante.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.