Why Perforce is more scalable than Git
Okay, say you work at a company that uses Perforce (on Windows). So you're happily tapping away using perforce for years and years. Perforce is pretty fast -- I mean, it has this "nocompress" option that you can tweak and turn on and off depending on where you are, and it generally lets you get your work done. If you change your client spec, it synchronizes only the files it needs to. Wow, that's blows the mind! Perforce is great, why would you ever need anything else? And its way better than CVS.
Suddenly you have to clone something with git, and BAM! The world is changed. You feel it in the water. You feel it in the earth. You smell it in the air. Once you've experienced git, there is no going back, man. Git is the stuff man. You might have checked out firefox -- but have you checked out firefox ooon GIT?
So many really obvious things are missing in p4. Want to restore your source tree to a pristine state? "git clean -fd". Want to store your changes temporarily to work on something else? "git stash". Share some code with a cube-mate without checking in? "git push". Want to automatically detect out of bounds array accesses and add missing semicolons to all your code? "git umm-nice-try"
Branching on git is like opening a new tab in a browser. It's a piece of cake. You can branch for EVERY SINGLE BUGFIX. And you wrote the code, so you get to merge it back in, because you are the expert.
Branching on Perforce is kind of like performing open heart surgery. It should only be done by professionals: experts in the art who really know what they are doing. You have to create a "branch spec" file using a special syntax. If you screw up, the entire company will know and forever deride you as the idiot who deleted "//depot/main". The merging is done by gatekeepers. Hope they know what they're doing!
Now, if you have been using git for a few days you might discover this tool called "git-p4". "AHA!" you might say, "I can import from my company's p4 server into git and work from that, and then submit the changes back when I am done," you might say. But you would be wrong, for a number of reasons.
Really. It's just a big python script, and it works by downloading the entire p4 repository into a python object, then writing it into git. If your repo is more than a couple of gigs, you'll be out of memory faster than you can skim reddit.
But that problem's fixable. I was able to hack up git-p4 to do things a file at a time in about an hour. The real problem is:
Okay this is subjective because it depends on your definition of large. When I say large, I mean about 6 gigs or so. Because your company's source tree is probably that large. If you have the power, you will use it. Maybe you check in binaries of all your build tools, or maybe for some reason you need to check in the object files of the nightly builds, or something silly like that. P4 can handle this because it runs on a cluster of servers somewhere in the bowels of your company's IT department, administered by an army of drones tending to its every need. It has been developed since 1995 to handle the strain. Google also uses Perforce, and when it started to show its strain, Larry Page personally went to Perforce's headquarters and threatened to direct large amounts of web traffic up their executives' whazzoos until they did something about it.
Git has none of that. The typical git user considers the linux kernel to be a "large project". If you've looked at Linus's git rant on Google code, take a listen to see how he sidesteps the question of scalability.
Don't believe me? Fine. Go ahead and wait a minute after every git command while it scans your entire repo. It's maddening because its long enough to be annoying, but not enough time to skim Geekologie.
You know what? I don't think many people really use distributed source control. The centralized model is here to stay. Most git users (especially those using Github) use the centralized model anyway.
Ask yourself this: Is it really that important to duplicate the entire history on every single PC? Do you really need to peruse changelist 1 of KDE from an airplane? In most cases, NO. What you really want is the other stuff: easy branching, clean, and stash, and the ability to transfer changes to another client. The distributed stuff isn't really asked for, or needed. It just makes it hard to learn.
Just give me a version control system that lets me do these things and I'll be happy:
Is that really so hard?
git-p4 can't handle large repositories
Git can't handle large repositories
The solution
I happen to use Perforce at work, and Git for personal projects. I don't dislike Perforce, even as it has the drawbacks you describe. I like Git, now that it has good Windows support, particularly for its branching capability, sorely missing in Perforce.
Eg. when you work on a feature, then you have to do a quick fix... touching one of the files you change for the feature.
Perforce also has good visual tools (P4V). The time-lapse view and the revision graph are particularly powerful.
I know Perforce is appreciated in the game industry, precisely for the reason you mention: large repository handling, and particularly large file handling. Game assets (images, maps, sounds, movies, etc.) can take up lot of space!
But Git didn't want to be left behind. So, seven years after your article, Git has a special feature to handle large assets. I haven't tried it, I don't know if it is on par with Perforce, but it here.
On the other hand, Perforce evolved too: they allow to shelve files (equivalent of stash), which also allows to share uncommited code with co-workers.
They also allow to work off-line, in case the server is down (or not reachable)...
Still no easy branching, reserved to Perforce gurus of the company...
And still using this pesky read-only attribute...
Here is a correction to this article and a list of updates to Perforce that change some of the things described here (can't blame it for being written a while back; the world changes).
Re: "So many really obvious things are missing in p4." …
::Want to restore your source tree to a pristine state? "git clean -fd".
--> As of Perforce 2014.1, the "p4 clean" command does this.
::Want to store your changes temporarily to work on something else? "git stash".
--> This has been possible with the "p4 shelve" command since P4 2009.2.
::Share some code with a cube-mate without checking in? "git push".
--> There are ways to do this, but creating a branch for every person or code fix isn't a typical way of doing business in P4.
Re: Branching, git vs. P4
::Branching on Perforce is kind of like performing open heart surgery. It should only be done by professionals: experts in the art who really know what they are doing. You have to create a "branch spec" file using a special syntax.
--> This really has never been true. Branch specs are helpful but not required. If you understand branching strategy for your team/group/company, this isn't difficult at all. Merging, on the other hand, can be ugly if you do it wrong and submit the changes. That's true with any SC system.
::If you screw up, the entire company will know and forever deride you as the idiot who deleted "//depot/main".
--> You can't really delete a branch by branching. By merging, sure. This is what rollback is for.
en.wikipedia.org/wiki/Repo_(script)
basically, repo allows you to combine different git repositories together.
In the case of android, each hardware company (eg Qualcomm for their radio, Broadcom for their bluetooth/wifi) will have separate git repositories for each component.
Repo manages all the git repositories automatically (you can still control git yourself)
(yes, it still wouldn't solve problems of having many large binary blobs and calculating md5sums for them)
(And for those of us outside the USA, being able to work offline is a must, but I can see how not everyone will care about that.)
Finally, you're talking about disk space. Mind telling me why I need to have double the disk space available with Perforce just to be able to switch quickly between any 2 branches? I have actually run out of disk space just because of this and have lost valuable productive time trying to free up enough space to check out another branch. Never again.
On the other hand, I'm using perforce right now. Turns out that even a simple merge, check-in or branching is slow. The client continously polls the server, sometimes crashes if you make it age and must rely on network and servers for every little thing you want to do. Yes, shelving relies on the server, the server even keeps track of what I have and what I don't, with the obvious desynchronization issues.
For each file, if the cost of a checksum is less than the cost of downloading the whole file, they should try to do an incremental transfer.
I really want the local branches and lack of needing to check out files, and we gave up on a server update since 2002 (it was that or health insurance -- that bad), but it's just become a big time sink for me to even investigate it anymore.
So I thought, "hey! I'll just take a fresh install and make it a git repo." This worked to some degree but some of the mods had large files. Eventually when I went to switch to a different branch it just died with an out of memory error.
This is because git has to be able to store the whole file in memory to process it. My machine has 6GB of RAM (the one I was using) but on Windows most versions of git are 32-bit.
Bam. Dead in the water. I had to actually boot up Ubuntu on a live disc, apt-get install the 64-bit version of git just to swap branches. Fail; plain and simple.
It sucks to have a designer create a great tool like git only to have him also be too lazy to solve some edge cases for others.
* File sizes > RAM? This should be doable in a slower way only when needed.
* File sizes > 32-bit version capabilities? Again fix it but have it use the slower algorithm only when needed.
* 32-bit only version.... Seriously most new computers other than netbooks have 64-bit capability these days. Just make it the default
Being too stuck up to solve this problem that would obviously increase adoption of your tool just seems dumb. And for those who say >6GB repos and you're doing something wrong or don't have large repos or don't revision large files obviously haven't run across a business need to do so but when your paycheck requires it you'll be singing a different tune.
I used Perforce when I worked at Google and will likely use it again in my next company for which I just got hired. I like it but I know I am going to miss features from a DVCS. I used Bazaar at my last company and it was quite nice but also suffers from the same problem as git and I believe hg.
Zero downtime. No administration needed. What else can one ask for?
>but the number of files that is painful
That's exacly my case. We tried to migrate WebMethods repository containing lots of services (corporate scale, all currently used/deployed, and cannot be split into submodules/subtrees). It contanis like 100k files, and doing simple git status took about 10 minutes of disk IO while it was scanning for changes.
www.jaredoberhaus.com/tech_notes/2008/12/git-is-slow-too-many-lstat-operations.html
git is clearly designed for what I would call "small" projects like the Linux kernel. If you want to do another project, you do not add it to an existing git repository, you make another one. This best fits with pushing and pulling a single project. But if you have a large system that is composed of many such smaller projects, you have to use something other than the source control system to synchronize their dependencies.
Translated:
>"I don't need distributed source control, so I know nobody out there will need it, as I don't see why they should. But they WILL need to move 6gb repos, because I do, so that's what normal people needs."
In short, different people different needs. I'm the happiest SCM user since I switched to git for my <6gb projects, which doesn't mean it does have to fit everyone and every possible project, for the same reason I don't use vim to edit jpg files.
Or you could go my way and have your .git folder actually be a symlink to a folder on another machine over ssh.
Please. When you argue about this stuff, please research thoroughly. There is a lot of things you can do with git that just takes a while to learn.
Git is like really good drugs.
It works well under the DVCS tools such as Git and Mercurial (though a lack of branch naming is sometimes an issue depending upon the tool) - it works absolutely blindingly under Clearcase - unfortunately for Clearcase it is expensive IBM software, and the hardware constraints on that tool (particularly for dynamic views) make it compromises also.
Perforce, CVS, Subversion are cut from the same cloth however - they are lightyears behind the branching capabilities of DVCS's and also Clearcase which has had fantastic branching semantics available since the mid-90s.
BTW, rather than store the data in the repo, we've started storing the git hashes with the data. Works nicely.
Why would you ever expect git to work well in a centralized usage scenario? Would you expect p4 to work well in a distributed use case? Honestly, dude...Apples and Oranges.
And what happens when that central server is inaccessible? or when you're travelling to a trade show with a demo and you have a really cool idea on the plane you'd like to try out? P4 can be a real pain in the proverbial wazoo in those circumstances.
You'll want to get some p4api and set P4API_BASE to the directory where you untar it; this lets the plugin use the C++ bindings for perforce instead of running the command-line client.
Look at Documentation/vcs-git-p4.txt for how to configure it; you generally end up actually getting data simply with "git fetch origin" (or "git fetch" if you apply the bugfix I forgot to send back from work).
I'm in the games biz myself and we ran into these problems with svn. Once we got past a certain size team and asset base, it started to really choke. I wrote up a little postmortem at scottbilas.com about our experience with it (search for 'svn').
We tried really hard to make svn work because of the astronomical price of P4. A price that we all grudgingly pay again and again in this industry because everything else is so much worse.
My current plan is to clone the commands from git into our command line p4 extension tool we have (it does things like auto-creating Crucible code reviews and such). For example, 'stash' should be pretty easy to implement. Actually, it already exists. Search the p4 public depot for 'p4tar'. I haven't tried it out yet.
Anyway the other commands should be implementable with a tool on top of p4 using p4api.net. If I only had some spare time.. :)
Of course with the Perl Perforce repository, the size was something like 450MB in Perforce and 70MB in Git, once the crazy metadata format used by perforce's insane integration system were appropriately grokked.
I mean, don't get me wrong, I think Perforce is a great product - beats SVN hands-down in design and was around many years before - it's just too complex. Integration is badly modelled, hardly anyone understands it properly. So in that respect, Perforce doesn't scale to very large teams because the branching model is too hard to work with.
Yes of course Git doesn't do a lot of that product release cycle development / Software Configuration Management. It's unix: it does one thing and does it well.
IMHO, a version/revision control tool, with all it's diff, 3-way-merging, and compressed delta storage goodies is at it's best when it's storing editable source. Storing binary data, especially binary data that can be recreated from the version controlled source, is not the ideal use for this kind of system. That said, I've done it too, because I also believe that every version of the source should include the tools used to process the source into product shipped to the customer. But I would like to consider the use of a different paradigm for the archiving of binary data, especially mongo BLOBs. I would like to consider a system more ideally suited to storing Big Honkin binary files, and have a reference to those BLOBs in the version control system. Now I wonder what would work.....
Why? What's the big deal with checking in? Use a personal branch, and have your bunker-mate use one two. Check-in your WIP on a regular basis, just in case your drive goes kablooie.
{quote}Let me "stash" stuff cause it's really handy. Clean is nice to have too.{/quote}
I must be missing something. Wouldn't a personal branch work just fine for this?
{quote}Make branching easy. {/quote}
Branching in Perforce is difficult for users who don't understand the nuances of client workspace mapping. When you understand how the repository is structured, and how your local hard drive is layed out, it becomes so much easier. If you don't know the structure of the repository, which contains the family jewels, please turn in your coder's badge. If you don't know how your own disc is structured, please turn in your computer.
{quote}Don't waste 40% of my disk space with a .git folder, when this could be stored on a central server. {/quote}
Good idea. I'm curious -- let's say we had a multi-Tb repository, with 80k files on just one tip, tens of thousands of branches, 1600 coders, 11 locations, 8 time-zones. If we were using GIT, and I wanted to work disconnected from the network for a couple days, what would be "gotten" onto my laptop?
If you have that large a repo, it's probably because you're stuffing large binary blobs into git. If you're stuffing large binary blobs into git, you need to look into the .gitattributes file so that git won't try to diff/compress said large binary files. It's got some heuristics to try and recognize them, but making its work a bit easier is sure to show you some gain.
However, for 99% of the software developers out there, git (or one of it's DVCS brethren) just works. In those cases, the benefits of being entirely mobile, having near zero time cost for most actions, and the ability to easily experiment with the contents of the repository are game-changing wins. For the top 1%, there are tools like Clearcase and Perforce.
Thanks,
John
So yeah, it's scaleable, but it's directly proportional to the size of the server it's on.
In the meantime, it's still just easier to dump stuff into P4. Beefy 64-bit P4 servers are cheap to build now.
it seems as if there is a specific problem with Git, namely it doesn't handle large binary files well (large images, artwork, etc).
Has anyone actually taken this specific use-case to the Git developers on the mailing list?
Second, it seems like your problem could be solved by having a separate machine to run Git just for your Binary assets. When you need to make a build, you just dump all those files to the machine, have it version the directory, and then include that 'version' into your Git source repo.
Interesting post.
Any web company with non-source code in their repo will run into the same thing. I'm surprised more people haven't pointed out this glaring problem with the git model.
The number of simultaneous clients that can be doing operations is just as important, if not more so. P4 was notorious for holding locks far longer than necessary, and clients would queue up for minutes at a time (I rememeber syncs that would take more than half an hour on a fairly small repository because there were a hundred other clients trying to sync).
P4 does *not* in fact, scale well (although, I admit that more recent versions of P4 are better than what I was using is 2004).
I feel pretty confident your assessment of git would be different if you had 1,000 coworkers using your P4 repository at the same time.
As far as flexibility Git is Awesome, so stop posting things that don't make sense.
Then you've never been responsible for builds in the games biz...
A lot of teams try out Alien Brain, and quickly realize that
1. It's structured like Visual Source Safe or CVS (in other words you aren't REALLY versioning changes, just files, and that's really bad), and
2. Versioning artwork against code is just as important as versioning one code change against another or one artwork change against another, and having your artwork and your code in different version control systems, even when they're both structured around atomic changes (which Alien Brain isn't) causes problems.
So most teams just dump artwork, intermediate data files, and all sorts of things in the same p4 depot that their code is in. And it works like a champ. Except that p4 is missing so many of the cool features that git gives you.
But not much later:
"Let me merge changes into my coworker's repos, without having to check them in first."
That would be distributed stuff.
(All SCM GUIs suck, imo, but that's my CLI-bias bleeding through)
Well you may have identified another use case where Git is not ideal - really large binary blobs. I think the problem is Git has to checksum (sorry SHA1) all files it scans - and that would take some time on a 36GB file.
To be fair, Git has always been advertised as a SCM - i.e. a source-code management system - and for that use-case it absolutely rocks IMO. Personally I would still investigate a hybrid approach where you have the option of pulling just the source down to your lappy with Git, so if you are on the plane and you DO want to look at change-set 1 at least you can!
Bypassing the central respository to share patches... meh. This I do not see as a feature -- if there's a central repository, it should be used as the mechanism of communication between developers.
On the other hand, "stashing" stuff is really nice. And branching (and merging) *should* be easy. I'm all over those two requests.
As for wasting my disk space... meh. Sometimes I care, sometimes I don't (disk is cheap, but disk fills up faster still). Having an option for git to use either a local or a remote (central/blessed) repository would be nice.
Disclaimer: I still use CVS, I've used Perforce (and liked it), and I use git (and like it), and I don't currently have an repositories that approach the sizes discussed in the article.
My one experience of Perforce was doing work with another company remotely. Our VPN was unfortunately a bit dodgy. Combine that with Perforce lead to an incredibly frustrating experience.
I wouldn't recommend using it if you're not on a LAN.
Agree with your points about scalability. Git is not good for anything other than source code (medium # of small text files).
Would that be your coworkers distributed repository, by any chance?
To merge changes into a coworker's repo, why can't they just patch a CL? You don't have to submit a CL for a coworker to grab the changes.
Well you may have identified another use case where Git is not ideal - really large binary blobs. I think the problem is Git has to checksum (sorry SHA1) all files it scans - and that would take some time on a 36GB file.
To be fair, Git has always been advertised as a SCM - i.e. a source-code management system - and for that use-case it absolutely rocks IMO. Personally I would still investigate a hybrid approach where you have the option of pulling just the source down to your lappy with Git, so if you are on the plane and you DO want to look at change-set 1 at least you can!
Really, for us, p4 works great. It stays out of our way, it's faster than anything out there. It's not distributed, but we don't care about that.
Git, plain and simple, does not scale to large repositories. That's OK, I guess, it's not really designed to handle that use case.
The solution? Track each project as a single Git repository, and if you need to tie them together, create a master repository that included each one as a sub-module. The flexibility you gain from 'setting free' your individual projects is enormous, as it the smart use of a master repository that uses branches to create different mash-ups of your overall code-base.