BerlinBuzzwords
http://berlinbuzzwords.de
Keynote – Grant Ingersoll – Lucid Imagination about Lucene
- @gsingers
- grant@lucidimagination.com
- http://www.manning.com/ingersoll
- SolrCloud == ZooKeeper + Solr
- http://wiki.apache.org/solr/SolrCloud
- http://en.wikipedia.org/wiki/Sentiment_analysis
- Map/Reduce Ready recommenders available
- Identify Topics
- Latent Dirichlet Allocation ==> http://de.wikipedia.org/wiki/Latent_Dirichlet_Allocation
- Frequent Pattern Mining
- Clustering
- K-Means, Dirichlet, Canopy, etc
- Carrot^2 Document and Search Result Clustering => http://project.carrot2.org/
- Eigent Cuts (spectral Clustering) => http://www.google.de/search?hl=de&q=eigen+cuts+clustering&aq=f&aqi=&aql=&oq=&gs_rfai=
- http://cwiki.apache.org/MAHOUT/algorithms.html
- Location Aware Search results
- Query Parsing
- Filtering
- Boosting
- Sorting
- Singular Value Decomposition (SVD) => http://en.wikipedia.org/wiki/Singular_value_decomposition
- technique for reducing the dimensionaltity of large matrices while retaining the core features of the larger space
- Latent Semantic Analysis uses SVD to provide search over the reduced space => http://github.com/algoriffic/lsa4solr
- Named Entity Recognition => http://en.wikipedia.org/wiki/Named_entity_recognition
*
- Finite-State Queries in Lucene http://lingpipe-blog.com/2010/03/25/finite-state-queries-in-lucene/
Lucene Forecast – Version, Unicode, Flex, Modules by Simon Willnauer
Making Software for Humans: CouchDB
- Jan Lehnardt
- a one-size-fits-all solution for scaling out is not really possible at all @janl #berlinbuzzwords
“Text and Metadata extraction with Apache Tika
- http://tika.apache.org/
- switched to Riak Talk
- http://berlinbuzzwords.de/content/metacarta-geosearch-toolkit-solr
LEARNING LESSONS: BUILDING A CMS ON TOP OF NOSQL TECHNOLOGIES
- http://berlinbuzzwords.de/content/learning-lessons-building-cms-top-nosql-technologies
- best talk so far
Elastic Search
- http://berlinbuzzwords.de/content/elasticsearch-you-know-search
- distributed
- completly HTTP based
- range queries possible
- json based
- lucene based
- filters: faster than queries (cachable)
- Near Realtime Search available
*
NeoJ – Peter Neubauer
- Nodes
- Relationsships between nodes
- properties on Nodes and props
- Traversal Framework
- Lucene integrated (indexing done on commit)
- http://www.google.de/search?hl=de&q=rdf+reasoning&aq=f&aqi=&aql=&oq=&gs_rfai=
Riak Search
- http://berlinbuzzwords.de/content/basho-search
- different query type patterns – keybased
- Consistent Hashing and Partitions
- Optimizations (to avoid “Obama Problem = Hotspots in the ring e.g. many documents containing the word Obama all going to the same node”)
- Bloom Filters & Caching
- Batching to sae query-time & index-time bandwith
*
Day2
Keynote
- ZeroMQ – 0MQ – http://www.zeromq.org/
- http://www.zeromq.org/blog:multithreading-magic
- “the fewest possible moving pieces” -> no service discovery.
- No Broker
Hypertable – http://berlinbuzzwords.de/content/hypertable-ultimate-scaling-machine
- used by Baidu search engine
- Dynamo – used by Amazon for Shopping Cart (uses Read-Repair and Consistent Hashing)
- during periods of failure there can be latency spikes, because of single machine handling the ranges / machines
- LSM – Log Structured Merge Tre
- eleminates random I/O => holding a log tree strucutre in memory which will be written to disk asynchronously (compaction)
- uses Bloom Filters => helps to avoid disk seeks by running every key through bloom filter to determine in which file the k/v is in
- Dynamic Memeory Adjustment based on Workload
- Hypertable vs. HBase
- 70% faster than hBase (seq. write / seq. read)
- one of the reasons: Dynamic Memory Adjustment
- You can run Map/Reduce (Hadoop) and it is Data Locality Aware
- Query types: timestamp ranges, versions of the cell possible, but mainly primary key access
- No delete operation currenlty (but each column family has a TTL)
HDFS Deep Dive
- http://berlinbuzzwords.de/content/hdfs-deep-dive
- one name node, multiple data nodes
- use 0.20.2-append branch
- NameNode SPOF in theory, but not in practise
Cassandra Talk – Eric Evans
- http://www.slideshare.net/jericevans/cassandra-explained
- Vectorclocks => http://en.wikipedia.org/wiki/Vector_clock
Massivly Parallel Analytics beyond Map/Reduce
- Context: StratoSphere Project
Colaborative Filtering Mahout – Frank Scholten
- is a machine learning lib for java
- run on hadoop
- Similarity Algorithms
- TanimotoCoefficintSimilarity
- LoglikelihoodSimilarity
Sqoop Database import/export for hadoop
- wish every speaker would have had logically structured their talks as Aaron Kimball did with his Hadoop talks
Hive
- parser/optimizer/compiler that translates HiveQL into MapReduce code
- metastore that stores “schema” information (e.g. table name, column names, data types)
- schema on read, not write
- unbedingt anschauen!!!
- WHERE => map
- GROUP BY/ORDER BY => reduce
- JOIN => map or reduce depending on optimizer
- there is an HiveQL EXPLAIN command
- python lernen