I would love to see some results of benchmarking.
I am currently looking at the advantages of node.js: http://nodejs.org/about/ they refer an advanced level talk of fefe explaining their concept of non blocking I/O. There is a nice article at beginner's level explaining the concept of event driven and non blocking I/O. but I really wonder how expensive it is to set up a call back and wait for it. Also this article focues very much on Comet which is nice to have but not the reason to choose a particular technology.
Upshot: Right now my feeling is the metalcon like button which creates a ton of http requests per second could probably be build on top of node.js. also I was thinking first (but not sure anymore) that node.js could be really a nice entry point for our application arhitecture since it can qucikly distribute our requests and pass them forward to various services. There is also a list of node modules which helps to see what node can do.
Eventmachine: A Ruby Framework for Ruby (and JRuby).
It provides clients for
- Socks and numerous other protocols.
It supports epoll (Linux), kqueue (BSD/OS X) and /dev/poll (Solaris)
- between features and technologies. Most in contrast to Riak.
- This article nicely describes scaling issues with MySQL and also a pretty nice hack around them: http://backchannel.org/blog/friendfeed-schemaless-mysql
- does not really scale beyond one machine. So if the graph grows over a certain size we are game over.
- good use case for recommender systems especially real time recommender
- pay attention to super nodes (which a social network certainly has)
- basically have to develop with embedded java graph data base since other drivers are to slow
- really perfect if one wants to keep track of the social network of a user.
- Highly distributed database build upon JSON-style documents
- Auto-sharding for seamingless horizontal scalabilty
- Sharding may be useful for write-intensive IO on the database-infrastructure. Otherwise it is not 100% necessary
- Features GridFS which can be used as module for nginx
- Enables fragmentation of BSON documents into chunks of 256kb. Filesize of BSON documents is limited to 16MB without usage of GridFS.
- Replication (Master/Slave)
- ReplicationSets for redundancy
- http://cassandra.apache.org/ (need to dig deeper)
- nice benchmark by netflix: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html maybe a little to far off.
- Highly distributed database with internal Key-Value store
- build-in Map-Reduce
- support for PHP, Ruby, Python, Node.js, Java ...
- consistend hashing
- What is Riak?
- Slides for brief introduction to Riak
- Riak in a productive environment
caching / indices
- easy to scale horizontally just add more machines.
- hard to think of persistance with data base layer.
- needs a certain caching layer (could be integrated on various layers)
- Key/Value Store
- performs well with massive amount of requests (also small ones) that rely on volatile data
- Small proof of concept of building an autocompletion "service" with redis
Benchmark: For the nerds http://redis.io/topics/benchmarks
The main question will be: do we have technologies (like cassandra or our own services) where we don't need caching or where we can cache without network requests?
solr / Lucene
I am pretty sure that Apache solr will be used to power our search infra structure. Reading this quora discussion one can see that there seems to exist only one reasonable alternative which is elastic search. From my understanding solr is more mature but it would be nice to find some solid benchmarks (which also relate to our usecase)
- How many documents will index?
- how many concepts / queries for auto suggest?
- how many GB of data?
- How many search queries to we expect per second
- How will we integrate personalized rankings?
Especially since we also want to index pages from external sites we should be able to parse html there is apache nutch which builds on solr, provides crawling, link database and HTML parsing. There exists an introduction to nutch
An example use of solr can be found here:
- source code: https://github.com/opencitations/OpenCitationsCorpus/tree/master/Vis/facetview
- demo: http://okfnlabs.org/facetview/
web frame works / programming languages
ruby on rails
Checking out the ruby on rails site you quickly come to the screencast section. I directly found a pretty good series on scaling rails. Most of the implemented techniques will work with other frameworks too but it was really nice to see how easily they could be integrated in a rails application. Also it was good to see some talks about tools one can use for load testing server applications.
You really want to get started here
- improved turbolink (reponsible for updating webpages partially)
gwtp on gwt
- good since all is java
- bad since result is an entire application
- bad since rather big technology stack
- not clear how long google will maintain the project
PHP / hip hop
- PHP lacks performance
- no experience with hip hop
Documentation: & http://symfony.com/doc/current/index.html
- Introduction to Symfony
python / djiango / giotto
- May be useful to establish low-latency connections and real-time capabilities with remote ressource.
- Avoids HTTP overhead and long polling
- Uses a single TCP connection
- Website stays responsive while a script is running
thrift seems like a very useful tool to create cross language RPC modules. Need to dig deeper into this
asynchronous network I/O library for python http://www.tornadoweb.org/en/stable/ Maybe there exist something like that for other languages too.
Based on this HTML5/CSS framework, we will be able to do something awesome. Good starting point!!
- easy to extend
- various forks (see github)
- very good semantic structure
Visual tools and snippets:
Master thesis on how to build a scalable webcrawler. IMHO a good introduction to this domain and recommended reading. Tons of external references mentioned.
Quick Introduction to Webservices and it's designated protocols and specifications: http://www.w3schools.com/webservices/default.asp
Interface design and user interaction
Design Patterns is a good knowledgebase on how to design user interfaces. For sure, all concepts have to be evaluated!
Short introduction on some topics of security in context of distributed systems: http://www.w3.org/standards/xml/security