Posts Tagged ‘nosql’

11 links on scalable software architecture and big data

August 19th, 2010 | By Christoph in Software-Development | No Comments »

From time to time I go through my bookmarks and fetch interesting links I came across during the last couple of month. This article is kind of Part 2 of a past article with links and videos about scalability and software architecture.

1. Learning from Five Years as a Skype Architect – Andres Kutt
2. Scale at Facebook
3. Murder: Fast datacenter code deploys using BitTorrent
4. What every programmer should know about memory, Part 1
5. Multithreading Magic
6. NoSQL  The Definitive Guide
7. Getting Good IO from Amazon’s EBS
8. Building a terabyte-scale data cycle at LinkedIn with Hadoop and Project Voldemort
9. Scalability of the Hadoop Distributed File System
10. Understanding Cassandra Code Base
11. Big Data in Real-Time at Twitter

My Notes taken at Berlin Buzzwords

Juni 30th, 2010 | By Christoph in Software-Development | No Comments »

BerlinBuzzwords

http://berlinbuzzwords.de

Keynote – Grant Ingersoll – Lucid Imagination about Lucene

  • @gsingers
  • grant@lucidimagination.com
  • http://www.manning.com/ingersoll
  • SolrCloud == ZooKeeper + Solr
  • http://wiki.apache.org/solr/SolrCloud
  • http://en.wikipedia.org/wiki/Sentiment_analysis
  • Map/Reduce Ready recommenders available
  • Identify Topics
  • Latent Dirichlet Allocation ==> http://de.wikipedia.org/wiki/Latent_Dirichlet_Allocation
  • Frequent Pattern Mining
  • Clustering
  • K-Means, Dirichlet, Canopy, etc
  • Carrot^2 Document and Search Result Clustering => http://project.carrot2.org/
  • Eigent Cuts (spectral Clustering) => http://www.google.de/search?hl=de&q=eigen+cuts+clustering&aq=f&aqi=&aql=&oq=&gs_rfai=
  • http://cwiki.apache.org/MAHOUT/algorithms.html
  • Location Aware Search results
  • Query Parsing
  • Filtering
  • Boosting
  • Sorting
  • Singular Value Decomposition (SVD) => http://en.wikipedia.org/wiki/Singular_value_decomposition
  • technique for reducing the dimensionaltity of large matrices while retaining the core features of the larger space
  • Latent Semantic Analysis uses SVD to provide search over the reduced space => http://github.com/algoriffic/lsa4solr
  • Named Entity Recognition => http://en.wikipedia.org/wiki/Named_entity_recognition
    *
  • Finite-State Queries in Lucene http://lingpipe-blog.com/2010/03/25/finite-state-queries-in-lucene/

Lucene Forecast – Version, Unicode, Flex, Modules by Simon Willnauer

  • switched to Kino 10

Making Software for Humans: CouchDB

  • Jan Lehnardt
  • a one-size-fits-all solution for scaling out is not really possible at all @janl #berlinbuzzwords

“Text and Metadata extraction with Apache Tika

  • http://tika.apache.org/
  • switched to Riak Talk

METACARTA GEOSEARCH TOOLKIT FOR SOLR

  • http://berlinbuzzwords.de/content/metacarta-geosearch-toolkit-solr

LEARNING LESSONS: BUILDING A CMS ON TOP OF NOSQL TECHNOLOGIES

  • http://berlinbuzzwords.de/content/learning-lessons-building-cms-top-nosql-technologies
  • best talk so far

Elastic Search

  • http://berlinbuzzwords.de/content/elasticsearch-you-know-search
  • distributed
  • completly HTTP based
  • range queries possible
  • json based
  • lucene based
  • filters: faster than queries (cachable)
  • Near Realtime Search available
    *

    NeoJ – Peter Neubauer

  • Nodes
  • Relationsships between nodes
  • properties on Nodes and props
  • Traversal Framework
  • Lucene integrated (indexing done on commit)
  • http://www.google.de/search?hl=de&q=rdf+reasoning&aq=f&aqi=&aql=&oq=&gs_rfai=

Riak Search

  • http://berlinbuzzwords.de/content/basho-search
  • different query type patterns – keybased
  • Consistent Hashing and Partitions
  • Optimizations (to avoid “Obama Problem = Hotspots in the ring e.g. many documents containing the word Obama all going to the same node”)
  • Bloom Filters & Caching
  • Batching to sae query-time & index-time bandwith
    *

Day2

Keynote

  • ZeroMQ – 0MQ – http://www.zeromq.org/
  • http://www.zeromq.org/blog:multithreading-magic
  • “the fewest possible moving pieces” -> no service discovery.
  • No Broker

Hypertable – http://berlinbuzzwords.de/content/hypertable-ultimate-scaling-machine

  • used by Baidu search engine
  • Dynamo – used by Amazon for Shopping Cart (uses Read-Repair and Consistent Hashing)
  • during periods of failure there can be latency spikes, because of single machine handling the ranges / machines
  • LSM – Log Structured Merge Tre
  • eleminates random I/O => holding a log tree strucutre in memory which will be written to disk asynchronously (compaction)
  • uses Bloom Filters => helps to avoid disk seeks by running every key through bloom filter to determine in which file the k/v is in
  • Dynamic Memeory Adjustment based on Workload
  • Hypertable vs. HBase
  • 70% faster than hBase (seq. write / seq. read)
  • one of the reasons: Dynamic Memory Adjustment
  • You can run Map/Reduce (Hadoop) and it is Data Locality Aware
  • Query types: timestamp ranges, versions of the cell possible, but mainly primary key access
  • No delete operation currenlty (but each column family has a TTL)

HDFS Deep Dive

  • http://berlinbuzzwords.de/content/hdfs-deep-dive
  • one name node, multiple data nodes
  • use 0.20.2-append branch
  • NameNode SPOF in theory, but not in practise

Cassandra Talk – Eric Evans

  • http://www.slideshare.net/jericevans/cassandra-explained
  • Vectorclocks => http://en.wikipedia.org/wiki/Vector_clock

Massivly Parallel Analytics beyond Map/Reduce

  • Context: StratoSphere Project

Colaborative Filtering Mahout – Frank Scholten

  • is a machine learning lib for java
  • run on hadoop
  • Similarity Algorithms
  • TanimotoCoefficintSimilarity
  • LoglikelihoodSimilarity

Sqoop Database import/export for hadoop

  • wish every speaker would have had logically structured their talks as Aaron Kimball did with his Hadoop talks

Hive

  • parser/optimizer/compiler that translates HiveQL into MapReduce code
  • metastore that stores “schema” information (e.g. table name, column names, data types)
  • schema on read, not write
  • unbedingt anschauen!!!
  • WHERE => map
  • GROUP BY/ORDER BY => reduce
  • JOIN => map or reduce depending on optimizer
  • there is an HiveQL EXPLAIN command
  • python lernen

Cassandra Distributed Database – a link list for beginners

April 12th, 2010 | By Christoph in Software-Development | 2 Comments »

I am currently playing around with Apache Cassandra (distributed database) and this article is to store all the links I have used before I close my browser with trillions of open tabs :)

http://wiki.apache.org/cassandra/GettingStarted
http://wiki.apache.org/cassandra/ClientOptions
http://prettyprint.me/2010/02/23/hector-a-java-cassandra-client/

http://wiki.apache.org/cassandra/DataModel

Installing and using Apache Cassandra With Java Part 5 Parts (very good article series!!!)
http://www.sodeso.nl/?p=80
http://www.sodeso.nl/?p=108
http://www.sodeso.nl/?p=207
http://www.sodeso.nl/?p=251
http://www.sodeso.nl/?p=354
Update 2010/04/13: Ronald Mathies commented that he has two new articles. Thanks!
About importing / exporting data from a Cassandra Database: http://www.sodeso.nl/?p=448
Creating custom sorting types for Cassandra: http://www.sodeso.nl/?p=421

WTF is a SuperColumn? An Intro to the Cassandra Data Model

http://stackoverflow.com/questions/1502735/whats-the-best-practice-in-designing-a-cassandra-data-model

https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/

Building a small Cassandra Cluster for Development and Testing

Cassandra: RandomPartitioner vs OrderPreservingPartitioner

http://about.digg.com/blog/looking-future-cassandra

http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/

http://emmanuelpozo.com/post/317479418/cassandra-messaging-1

Time Series-Data Model

If somebody has some more links on examples and best practices forCassandra data models, then please comment. Thanks.