Scalability

Not for 'how-to' coding questions but PHP theory instead, this forum is here for those of us who wish to learn about design aspects of programming with PHP.

Moderator: General Moderators

User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Scalability

Post by Christopher »

kaisellgren wrote:Is there anything that a PHP application must take care of in order to have a website operating under a clustered system? I bet no and if I'm right, then there are better places to find information about server related stuff..
I think the point of a cluster is that it appears as a single server. Not sure what you mean by "there are better places to find information about server related stuff" ?
(#10850)
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Scalability

Post by Eran »

Ultimately I am a web developer so I am always in search of solutions that are easy to deploy and maintain
What we've discussed mostly relates to server setup and the like, and a web developer shouldn't think too much about that. When it comes to the point you need to scale beyond one dedicated machine, you probably should get expert help.

From our point of view as web developers the biggest thing to consider is how to hide likely points of contention (such as the database) behind abstractions that will allow us to integrate scaling efforts without affecting our source code too much. Some scaling options need the application to be aware of them (such as sharding), but those are definitely not easy to maintain and setup, so it's probably not what you are looking for.

Didn't really understand what you meant about levels #3 and #4.
User avatar
kaisellgren
DevNet Resident
Posts: 1675
Joined: Sat Jan 07, 2006 5:52 am
Location: Lahti, Finland.

Re: Scalability

Post by kaisellgren »

arborint wrote:Not sure what you mean by "there are better places to find information about server related stuff" ?
I'm just saying if you are interested in setting up clustered servers or other scaling solutions, Server Fault, Web Hosting Talk and alike are probably the best places to look for information.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Scalability

Post by Christopher »

pytrin wrote:
Ultimately I am a web developer so I am always in search of solutions that are easy to deploy and maintain
What we've discussed mostly relates to server setup and the like, and a web developer shouldn't think too much about that. When it comes to the point you need to scale beyond one dedicated machine, you probably should get expert help.
Right, and currently getting single servers is simple enough to not need "expert help." And I have setup MySQL replication, etc. so I know that more complex architectures are getting easier. For example, I know MySQL 5.1 adds failover that give replication some advantages of a cluster in a simpler configuration. So part of my question is whether there are next-step-up architectures that have become easy enough to install/maintain as to no need "expert help." We used to need expert help to get LAMP systems up and running...
pytrin wrote:From our point of view as web developers the biggest thing to consider is how to hide likely points of contention (such as the database) behind abstractions that will allow us to integrate scaling efforts without affecting our source code too much. Some scaling options need the application to be aware of them (such as sharding), but those are definitely not easy to maintain and setup, so it's probably not what you are looking for.
Well, I think that is part of it. I was thinking of adding some features like Wordpress's HyperDB and other projects implement to the database connectors (rather than having to have multple connectors). There seem to be two different features -- first is read/write separation with two separate connections and second is selecting among a list of connections for read based on some criteria (priority, random, round robin, etc)
pytrin wrote:Didn't really understand what you meant about levels #3 and #4.
It is from my comment above:
arborint wrote:... The simplest and most common small web application architectures are:

1. Single server solution - static content, dynamic content, and database server all on the same server

2. Multiple single server solution - static content, dynamic content, and database server each on its own server. It could be two or three servers. This does not require any real architectural changes, just changing the DB connection host info and the URL to static content. This is really a performance improvement only. If any of the servers die, the site is essentially down.

So my question is, what are the next steps up from that? And how do different arrangements compare in performance, easy of setup/administration, stability, availability?
(#10850)
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Scalability

Post by Christopher »

kaisellgren wrote:I'm just saying if you are interested in setting up clustered servers or other scaling solutions, Server Fault, Web Hosting Talk and alike are probably the best places to look for information.
Agreed, but part of my question is whether there are simple enough solutions now available that web developers with sysadmin experience can install/maintain them with no much more difficulty than single-server systems? And DBAs will never give you that answer. Only a DB savvy web developer would have figured that out. ;)
(#10850)
User avatar
kaisellgren
DevNet Resident
Posts: 1675
Joined: Sat Jan 07, 2006 5:52 am
Location: Lahti, Finland.

Re: Scalability

Post by kaisellgren »

arborint wrote:part of my question is whether there are simple enough solutions now available that web developers with sysadmin experience can install/maintain them with no much more difficulty than single-server systems?
Well I have Master-Slave replication (MySQL) set up on my home network and running fine although I don't use it for anything except testing purposes. It's just a matter of a few configuration changes and installing MySQL on 2+ computers. M-S lets you split the load of reads into n servers, but all writes still go to the single master server. If you want something above that, it's gonna be messy..

Edit: If you want set up MySQL Cluster, then it will not be straightforward, only in your wettest dreams ;)

If you have the time: http://dev.mysql.com/doc/refman/5.1/en/ ... uster.html :P

If you just want to setup a clustered system, it will be a lot easier than maintaining, securing, monitoring, etc. the system which is what you need to do with production sites.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Scalability

Post by Christopher »

This starting to get at my question. Clustering seems complex to setup. I may be wrong about that, so someone who has setup a cluster please correct me. Also a cluster seems to need 4-5 computers, but maybe I am incorrect there as well. But if clusters are too expensive/difficult then replication starts to look better on the low end. And again we are talking about both performance scaling and getting the website back online quickly after a failure.
(#10850)
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Scalability

Post by Eran »

Again, clustering and replication do different things. Clustering is more for high-availability than performance (for large datasets, performance suffers). Replication has lag problems in mysql since it's asynchronious (google did some to fix that - http://code.google.com/p/google-mysql-t ... ql5Patches). It's all about balancing the pros and cons.

There's also sharding, which is a very effective technique for scaling. Though it's probably the most complicated application wise (the application has to be aware of the configuration of the shards in some form).
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Scalability

Post by Christopher »

pytrin wrote:Again, clustering and replication do different things. Clustering is more for high-availability than performance (for large datasets, performance suffers). Replication has lag problems in mysql since it's asynchronious (google did some to fix that - http://code.google.com/p/google-mysql-t ... ql5Patches). It's all about balancing the pros and cons.
Yes, but for example MySQL 5.1 has replication with failover which can achieve some high availability goals. And I think that a lot of the confusion is the fact that they do different things yet have overlap, and that there are variations within each for a range of difference solutions. But for the specific target that I am discussing, which is the next step up for a regular web application, that the number of possible solutions are much fewer. And that type of system would probably be of interest to many people here as opposed to less common scenarios.
pytrin wrote:There's also sharding, which is a very effective technique for scaling. Though it's probably the most complicated application wise (the application has to be aware of the configuration of the shards in some form).
Yes, as I understand it sharding is for very large datasets. I have dabbled into this with MySQL by setting the autoincrement step to greater than 1 so writes can go to one of several servers. That is a simple solution that worked pretty well.
(#10850)
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Scalability

Post by Eran »

Yes, but for example MySQL 5.1 has replication with failover which can achieve some high availability goals.
You refered to this before, and I'm wondering about that. As far as I know there is no official support for replication failover in MySQL. Quoting from the MySQL manual:
There is currently no official solution for providing failover between master and slaves in the event of a failure. With the currently available features, you would have to set up a master and a slave (or several slaves), and to write a script that monitors the master to check whether it is up. Then instruct your applications and the slaves to change master in case of failure.
So basically you would need to write your own failover mechanism.
Yes, as I understand it sharding is for very large datasets. I have dabbled into this with MySQL by setting the autoincrement step to greater than 1 so writes can go to one of several servers. That is a simple solution that worked pretty well.
Simple for writing, but how would you know how to fetch the information if you are not getting it by the primary key?
Usually sharding is done by splitting the database into separate servers. There's also partioning which allows splitting the tables themselves. Both require that the application knows how to retrieve the data once it's no longer on the same server.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Scalability

Post by Christopher »

pytrin wrote:You refered to this before, and I'm wondering about that. As far as I know there is no official support for replication failover in MySQL. Quoting from the MySQL manual:

So basically you would need to write your own failover mechanism.
I have read there is now something in 5.1 that allows this. There is also a thing called MySQL Master-Master Replication Manager and active/passive multi-master replication. I think it does queries like a heartbeat and will promote a passive server to active on fail. I am not sure if that is it.
pytrin wrote:Simple for writing, but how would you know how to fetch the information if you are not getting it by the primary key?
Usually sharding is done by splitting the database into separate servers. There's also partioning which allows splitting the tables themselves. Both require that the application knows how to retrieve the data once it's no longer on the same server.
You do get it by primary key. It's just that primary keys 1,5,9.. are on one server and 2,6,10.. are on a second and 3,7,11.. are on another. You set the step to be however many servers there are. You can either have the application code determine the server, or with replication you can read back from any server. There are many ways to do sharding. This was just a something I tried that was really easy and worked well.
(#10850)
User avatar
Eran
DevNet Master
Posts: 3549
Joined: Fri Jan 18, 2008 12:36 am
Location: Israel, ME

Re: Scalability

Post by Eran »

You do get it by primary key
That was my point - any other way to fetch rows would be difficult. That's the main difficulty with sharding, coming with a scheme that still allows for complicated queries to be completed.
User avatar
Christopher
Site Administrator
Posts: 13596
Joined: Wed Aug 25, 2004 7:54 pm
Location: New York, NY, US

Re: Scalability

Post by Christopher »

pytrin wrote:That was my point - any other way to fetch rows would be difficult. That's the main difficulty with sharding, coming with a scheme that still allows for complicated queries to be completed.
Actually I was able to search on other fields by generating a random order to search the the multiple databases. It worked fine. But I don't really think huge data sets is much of a problem for you average developer like me. Likewise, scaling to 100 or 1000 servers is not a pressing need. That's why I was asking about the low end just above single-server solutions to see if there were any good solutions to improve availability.
(#10850)
User avatar
Weirdan
Moderator
Posts: 5978
Joined: Mon Nov 03, 2003 6:13 pm
Location: Odessa, Ukraine

Re: Scalability

Post by Weirdan »

arborint wrote:or with replication you can read back from any server.
And it's not sharding, just ordinary M-M setup, because there are no disconnected shards anymore since you started to replicate (in both directions, I suppose).
User avatar
William
Forum Contributor
Posts: 332
Joined: Sat Oct 25, 2003 4:03 am
Location: New York City

Re: Scalability

Post by William »

arborint wrote:
pytrin wrote:Simple for writing, but how would you know how to fetch the information if you are not getting it by the primary key?
Usually sharding is done by splitting the database into separate servers. There's also partioning which allows splitting the tables themselves. Both require that the application knows how to retrieve the data once it's no longer on the same server.
You do get it by primary key. It's just that primary keys 1,5,9.. are on one server and 2,6,10.. are on a second and 3,7,11.. are on another. You set the step to be however many servers there are. You can either have the application code determine the server, or with replication you can read back from any server. There are many ways to do sharding. This was just a something I tried that was really easy and worked well.
The problem with that is it's limited to a set number of servers. My two cents on those whole project is scaling small-medium sites is not that difficult.

(These are examples)
Step #1
Master/Master - ( Just two servers, keep them at around 50% load each that way either master can take over if something goes wrong )

Step #2 (you grow)
Master/Master with slaves. You read from slaves, you write to masters, simple.

Step #3
As you grow more and more reads you can start doing things like have slave-masters that other slaves will replicate off it's slave. So lets say the master writes to 10 slaves, then those 10 slaves write to 10 slaves each, you can scale reads forever almost.

Step #4
Okay so you're starting to have write issues. There is tons of ways to solve this problem as specified in posts before. Horizontal Sharding ( e.g. Splitting table data across multiple servers. For instance you might have 1,000,000 users on server A while the other 500,000 is on server B. ) Vertical Sharding ( e.g. Splitting up table(s) across multiple databases. You can do something like putting the entire users table on a separate database server, or before you have that big of a users table you can put multiple tables on the same database server. Basically instead of having one huge database server, you cut the tables into groups.)

Now the problem with this comes into relationships. You have to start denormalizing your data. JOINS are not your friend as you scale. For instance, Facebook doesn't have a single JOIN in production. To solve issues like this sites like Flickr, Facebook, Digg, etc start replicating their data across their servers. For instance User ID 51789 might be on Server #482 and they'll not only store that users information, but any comments, picture entries, etc on that server also. If two users needs to have the same data joined, they'll replicate the data. For instance an entry in the database links to two different users, then each of the user's servers will have that data. Sure, denormalization makes things much harder as your application logic needs to start adding in the fact that you duplicate data and need to update both, etc. But no one said this would be easy.

Also, to fix the issue with not being able to have primary keys, a lot of big sites have index servers. These index servers contain information tow here all the data is at. For instance I'll connect to my users index and ask them where user 45 is. It will return what cluster / server that user is located and then I'll connect to that.

Scaling isn't just the database, you can reduce A LOT of load on your database servers by caching. For instance, Memcache. Facebook has a 99% cache hit ratio. That means that 99% of the time they can pull data off their cache instead of having to get data from their database. That helps a lot. Even simple things like fragment caching can be amazing, on your main page for instance you can set data to expire after 4 hours. Or you can make it not expire at all and in the backend whatever is updating that info, also updates the cache. It's really simple.

For scaling static content, you can use a 3rd party service like Amazon S3, etc. Or if you want to host it on your own, it's not that difficult just make sure you use something other than Apache as said above it's not really the best for static content serving. You can use other software to help like for instance if you have a large data set where people need to filter out results, you can use something like SphinxSearch. It's not just made for Search, you can use it to help find data by adding filters on it's huge index without even specifying a search query. Lets say I want to find all the users from the ages 18-24, I want them to be male, and they need to be in the zipcode xxxxx, it will return the results instantly.

Anyways I wrote way too much. Hopefully the posts helps someone. Like someone stated above highscalability.com is a good resource. Cal Henderson wrote a great book called "Building Scalable Web Sites". He basically built Flickr. Although it's mainly PHP/MySQL based the solutions can be used on other languages / databases with a bit of common sense.
Post Reply