My Notes taken at Berlin Buzzwords

BerlinBuzzwords

http://berlinbuzzwords.de

Keynote – Grant Ingersoll – Lucid Imagination about Lucene

  • @gsingers
  • grant@lucidimagination.com
  • http://www.manning.com/ingersoll
  • SolrCloud == ZooKeeper + Solr
  • http://wiki.apache.org/solr/SolrCloud
  • http://en.wikipedia.org/wiki/Sentiment_analysis
  • Map/Reduce Ready recommenders available
  • Identify Topics
  • Latent Dirichlet Allocation ==> http://de.wikipedia.org/wiki/Latent_Dirichlet_Allocation
  • Frequent Pattern Mining
  • Clustering
  • K-Means, Dirichlet, Canopy, etc
  • Carrot^2 Document and Search Result Clustering => http://project.carrot2.org/
  • Eigent Cuts (spectral Clustering) => http://www.google.de/search?hl=de&q=eigen+cuts+clustering&aq=f&aqi=&aql=&oq=&gs_rfai=
  • http://cwiki.apache.org/MAHOUT/algorithms.html
  • Location Aware Search results
  • Query Parsing
  • Filtering
  • Boosting
  • Sorting
  • Singular Value Decomposition (SVD) => http://en.wikipedia.org/wiki/Singular_value_decomposition
  • technique for reducing the dimensionaltity of large matrices while retaining the core features of the larger space
  • Latent Semantic Analysis uses SVD to provide search over the reduced space => http://github.com/algoriffic/lsa4solr
  • Named Entity Recognition => http://en.wikipedia.org/wiki/Named_entity_recognition
    *
  • Finite-State Queries in Lucene http://lingpipe-blog.com/2010/03/25/finite-state-queries-in-lucene/

Lucene Forecast – Version, Unicode, Flex, Modules by Simon Willnauer

  • switched to Kino 10

Making Software for Humans: CouchDB

  • Jan Lehnardt
  • a one-size-fits-all solution for scaling out is not really possible at all @janl #berlinbuzzwords

“Text and Metadata extraction with Apache Tika

  • http://tika.apache.org/
  • switched to Riak Talk

METACARTA GEOSEARCH TOOLKIT FOR SOLR

  • http://berlinbuzzwords.de/content/metacarta-geosearch-toolkit-solr

LEARNING LESSONS: BUILDING A CMS ON TOP OF NOSQL TECHNOLOGIES

  • http://berlinbuzzwords.de/content/learning-lessons-building-cms-top-nosql-technologies
  • best talk so far

Elastic Search

  • http://berlinbuzzwords.de/content/elasticsearch-you-know-search
  • distributed
  • completly HTTP based
  • range queries possible
  • json based
  • lucene based
  • filters: faster than queries (cachable)
  • Near Realtime Search available
    *

    NeoJ – Peter Neubauer

  • Nodes
  • Relationsships between nodes
  • properties on Nodes and props
  • Traversal Framework
  • Lucene integrated (indexing done on commit)
  • http://www.google.de/search?hl=de&q=rdf+reasoning&aq=f&aqi=&aql=&oq=&gs_rfai=

Riak Search

  • http://berlinbuzzwords.de/content/basho-search
  • different query type patterns – keybased
  • Consistent Hashing and Partitions
  • Optimizations (to avoid “Obama Problem = Hotspots in the ring e.g. many documents containing the word Obama all going to the same node”)
  • Bloom Filters & Caching
  • Batching to sae query-time & index-time bandwith
    *

Day2

Keynote

  • ZeroMQ – 0MQ – http://www.zeromq.org/
  • http://www.zeromq.org/blog:multithreading-magic
  • “the fewest possible moving pieces” -> no service discovery.
  • No Broker

Hypertable – http://berlinbuzzwords.de/content/hypertable-ultimate-scaling-machine

  • used by Baidu search engine
  • Dynamo – used by Amazon for Shopping Cart (uses Read-Repair and Consistent Hashing)
  • during periods of failure there can be latency spikes, because of single machine handling the ranges / machines
  • LSM – Log Structured Merge Tre
  • eleminates random I/O => holding a log tree strucutre in memory which will be written to disk asynchronously (compaction)
  • uses Bloom Filters => helps to avoid disk seeks by running every key through bloom filter to determine in which file the k/v is in
  • Dynamic Memeory Adjustment based on Workload
  • Hypertable vs. HBase
  • 70% faster than hBase (seq. write / seq. read)
  • one of the reasons: Dynamic Memory Adjustment
  • You can run Map/Reduce (Hadoop) and it is Data Locality Aware
  • Query types: timestamp ranges, versions of the cell possible, but mainly primary key access
  • No delete operation currenlty (but each column family has a TTL)

HDFS Deep Dive

  • http://berlinbuzzwords.de/content/hdfs-deep-dive
  • one name node, multiple data nodes
  • use 0.20.2-append branch
  • NameNode SPOF in theory, but not in practise

Cassandra Talk – Eric Evans

  • http://www.slideshare.net/jericevans/cassandra-explained
  • Vectorclocks => http://en.wikipedia.org/wiki/Vector_clock

Massivly Parallel Analytics beyond Map/Reduce

  • Context: StratoSphere Project

Colaborative Filtering Mahout – Frank Scholten

  • is a machine learning lib for java
  • run on hadoop
  • Similarity Algorithms
  • TanimotoCoefficintSimilarity
  • LoglikelihoodSimilarity

Sqoop Database import/export for hadoop

  • wish every speaker would have had logically structured their talks as Aaron Kimball did with his Hadoop talks

Hive

  • parser/optimizer/compiler that translates HiveQL into MapReduce code
  • metastore that stores “schema” information (e.g. table name, column names, data types)
  • schema on read, not write
  • unbedingt anschauen!!!
  • WHERE => map
  • GROUP BY/ORDER BY => reduce
  • JOIN => map or reduce depending on optimizer
  • there is an HiveQL EXPLAIN command
  • python lernen
Dieser Beitrag wurde unter Software-Development abgelegt und mit , , , verschlagwortet. Setze ein Lesezeichen auf den Permalink.

3 Antworten auf My Notes taken at Berlin Buzzwords

  1. Pingback: casino online

  2. Marta Medina sagt:

    Mejor redactado imposible. Muchas gracias. Besos!!

  3. Una información de lo más interesante.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

CAPTCHA-Bild

*