Creating an e-commerce platform with Couchbase 2.0
For the last couple of months, I’ve been involved in the development of an e-commerce website named http://presive.com, focused on beautiful, unique products created by designers all over the world with a story behind.
When I first met @swilera, @eoingalla and @cmgottschalk, the idea seemed so interesting to me that I got involved in the project really fast. And then, in a few weeks, we had the first release of presive.com. We created it using Python, Django, Satchmo and running on the top of a PostgreSQL database engine and it worked well for some weeks, until we started developing the first business-level report queries. The whole object-relational stack the product was based on wasn’t enough at all for our purposes, so a technological change was needed.
And then, we moved to Couchbase. At first, I was quite skeptical on the power of NoSQL databases, but I soon changed my mind when I visit couchbase and read everything about it .
So let’s start on the conceptual model (yes, I am a lecturer at the UPC School of Informatics & Computer Science, so I’m meant to be a bit of an academic).
Conceptual Model
The CM of Presive.com is roughly sketched below
So the first thing to do was to create a (say) database schema from it. After some reading and a lot of guessing, I figured out that the best way of thinking about the NoSQL document model is the “cluster-join” approach. This is: forget about inner or left-outer joins. Store all the data you can in an already “joined” form.
So, for instance, a user became something like this:
As you can see, addresses are stored in an array inside the user object itself, so you can get rid of one expensive join operation.
On the other hand, and following the same rationale, we created product documents that live on their own (jsonType = product) and also placed the same document information inside the Store documents (jsonType = store). By doing this, we created replication of information in our database, totally broke the traditional relational model and forced the system to cascade-update Stores when updating a Product information. It also increases the complexity of the code (update, edit) but, in return, it gives the system an amazing speed.
Saving data
How data is saved into Couchbase can slightly change depending of your chosen development language. But it essentially consists in creating a Javascript object, stringifying it and sending it to couchbase. Remember that Couchbase is a key - value storage system, so you’ll need to provide an ID to the document being sent.
Editing is, by the way, somehow more complicated. Depending on how you retrieved the document (ie: the format of your object) and the library you are using, you should remove some of the data in the document before saving it. This is because couchbase (even in the admin console) returns an error when trying to save a document having the same revision (“_rev” attribute). So I recommend you to delete the keys named _id, _rev, $flags and $expiration from your object before replacing an existing doc. Remember this in the future to avoid time-wasting mistakes.
Retrieving data
Back on PostgreSQL, a typical query would look like this:
Now, on Couchbase, the process is completely different. No more querying like that. It’s all about mapping and reducing.
When I first met @swilera, @eoingalla and @cmgottschalk, the idea seemed so interesting to me that I got involved in the project really fast. And then, in a few weeks, we had the first release of presive.com. We created it using Python, Django, Satchmo and running on the top of a PostgreSQL database engine and it worked well for some weeks, until we started developing the first business-level report queries. The whole object-relational stack the product was based on wasn’t enough at all for our purposes, so a technological change was needed.
And then, we moved to Couchbase. At first, I was quite skeptical on the power of NoSQL databases, but I soon changed my mind when I visit couchbase and read everything about it .
So let’s start on the conceptual model (yes, I am a lecturer at the UPC School of Informatics & Computer Science, so I’m meant to be a bit of an academic).
Conceptual Model
The CM of Presive.com is roughly sketched below
So the first thing to do was to create a (say) database schema from it. After some reading and a lot of guessing, I figured out that the best way of thinking about the NoSQL document model is the “cluster-join” approach. This is: forget about inner or left-outer joins. Store all the data you can in an already “joined” form.
So, for instance, a user became something like this:
{
"_id": "44fb07ca-69ee-4fc1-bb44-e0b13f3d08d8",
"_rev": "56-000a6a6984fca8a10000065300000000",
"$flags": 0,
"$expiration": 0,
"email": "p.casado.arias@gmail.com",
"password": "password_in_encrypted_in_some_magical_way",
"tos": "yes",
"name": {
"givenName": "Pablo",
"familyName": "Casado"
},
"jsonType": "user",
"last_login": "2012-07-20T21:31:34+00:00",
"lang": "en",
"time_zone": "Madrid",
"displayName": "Pablo Casado",
"billingAddresses": [
{
"addressee": "Pablo Casado",
"street": "Some Place",
"postalCode": "08840",
"city": "Viladecans",
"stateProvince": "Barcelona",
"country": "Spain",
"phoneNumber": "34690916113"
}
],
"created_at": "2012-03-02T20:03:57+00:00"
}
As you can see, addresses are stored in an array inside the user object itself, so you can get rid of one expensive join operation.
On the other hand, and following the same rationale, we created product documents that live on their own (jsonType = product) and also placed the same document information inside the Store documents (jsonType = store). By doing this, we created replication of information in our database, totally broke the traditional relational model and forced the system to cascade-update Stores when updating a Product information. It also increases the complexity of the code (update, edit) but, in return, it gives the system an amazing speed.
Saving data
How data is saved into Couchbase can slightly change depending of your chosen development language. But it essentially consists in creating a Javascript object, stringifying it and sending it to couchbase. Remember that Couchbase is a key - value storage system, so you’ll need to provide an ID to the document being sent.
Editing is, by the way, somehow more complicated. Depending on how you retrieved the document (ie: the format of your object) and the library you are using, you should remove some of the data in the document before saving it. This is because couchbase (even in the admin console) returns an error when trying to save a document having the same revision (“_rev” attribute). So I recommend you to delete the keys named _id, _rev, $flags and $expiration from your object before replacing an existing doc. Remember this in the future to avoid time-wasting mistakes.
Retrieving data
Back on PostgreSQL, a typical query would look like this:
SELECT givenName, familyName from users WHERE email = “p.casado.arias@gmail.com";
Now, on Couchbase, the process is completely different. No more querying like that. It’s all about mapping and reducing.
On the Couchbase admin website above, you need to provide two different functions: the map (left, big) and the reduce function (left, small). In order to “query” the database, we’ll focus on the map function. So, to emulate the previous SQL query, your map function should look like this:
Take a look back at the user document example. Specifying the table you want to retrieve data from is equivalent to the if statement above. The result of the query will only contain information for documents having that jsonType. Then, choosing the “columns” you want to retrieve is equivalent to specifying the “value” in the emit function (in this case, an array).
If you create a view like that on your couchbase administration website under “_design/dev_test/_view/test”, you can get the results from it by navigating to:
Finally, the information you’ll get from couchbase will be like this:
Couchbase will always return the document ID and then, the key and the value you emitted. By passing the “key” argument in the previous URL, you are indicating to the database engine to return only information from the document having the matching key. In other words: you’ll only get results if the the first argument in the “emit” call matches the value of the key argument in the URL.
Stock control
The underlying business model of Presive.com required a 15-minute reservation cart for each product a user added to the cart. This means that, if you have added a product to your shopping cart, you have 15 minutes to go through the checkout process (or add another item to the cart) before your reservation expires.
This made the stock control process more complex than usual. It involved not only decreasing the actual stock for each SKU after a purchase is completed, but also when users add a product to their carts. In order to deal with that, we created a set of documents having jsonType = cart, containing the ID of the user and and a list of the items added to it. Those were created when a user added a product to the cart. See the example below:
To close the loop, a new “reduce” view was created. The reduce function was just “_sum”. And the map function looked like this:
The aim of this view was to get the amount of items in every single cart document. To achieve this goal, we used the startkey and endkey arguments of a Couchbase view call, like this:
What we are doing here is specifying a startkey and an endkey (couchbase will only return keys between those values), and ask for a level 1 reduction. In plain english: We are asking couchbase to add (_sum) the emitted values (quantities in cart) of the carts containing the product and created between those two dates.
You can do a lot of magic by emitting complex keys in your map functions. Start and end keys will always be sorted alphanumerically according to unicode. Read more about sorting here: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-querying-ordering.html
So, each time we needed to retrieve the stock of a product, we called the previous view by indicating the ID of the product and two datetimes string: one containing a “fifteen minutes ago” and the other containing a “now”.
Couchbase is perfect for reporting queries but, apparently, it is not so good when trying to emulate the “commit” mechanism of a relational database. There is no such a thing as COMMIT, BEGIN TRANS or END TRANS. If you want to get the current contents of the view, they won’t be available until the next time you call the view. Let’s use an example to explain this.
1.- Call view. View returns 1, 2 and 3
2.- Insert 4
3.- Call view. Still returns 1, 2 and 3.
4.- Call view. Now, it returns 1, 2, 3 and 4
This is caused because every view created a index when it is updated and just after returning the results. Couchbase uses that index to return the stored values very, very fast. After the operation finished, it updated the index again. In the previous example, The index is updated after step 3. Not before step 3 or during the execution of step 3. This is why you get the “wrong” values.
How to deal with this in a almost-real-time environment like the 15-minute reservation explained before? Well... the answer is simple. Just add stale=false to the call, like this:
This will force Couchbase to “update the index” before returning you any results. Then, you’ll get the expected, traditional results in step 3 above. But be aware of using this carefully and only in queries that require it, because it will decrease the performance of your database.
Where is my JOIN?
By choosing the described “join-cluster” approach when defining the structure of your data documents, you’ll stay away from headaches involving joins and couchbase. A newbie would say that it is impossible to “join” in Couchbase, as there are no foreign keys and the map functions cannot access all documents several times.
The first (and naïf) approach would consist of retrieving the designer by ID and then, query Couchbase again to get all the products by their designer. The “products by designer” view would look like this:
(Note: doc.images is an array)
If you need to retrieve all the products of a single designer, you can send the id of the designer like this:
And, if you need to retrieve all the products from several designers, you can go like this:
SQL-like speaking:
- key is equal to “WHERE designerId = ‘some_designer_id’”
- keys (notice the array!) is equal to “WHERE designerId IN some_list_of_designer_ids”
This can be useful in many situations, but it wasn’t exactly what we were trying to achieve. Let’s say that we don’t want to be naïf and that we don’t want to query the database many times by the use of some ignote join-like operator. There is a technique named View Collation that will emulate the inner join operation. Trust me: view collation is your friend. I learned it from here: http://wiki.apache.org/couchdb/View_collation
To keep ourselves focused: the idea is to retrieve the designer and all his/her products by querying the database only once. Then, you’ll need to create a view like this one:
And call it like this:
Because of the UTF-8 sorting implemented in Couchbase, the first element in the result rows will be the designer having the specified key. Then, the following elements will be products designed by that designer. You have your “join”, now. This is a sample result:
On returning all the contents of the docs
In a very early stage of the development, we had views that returned the whole document as the emitted value. The simplest view like this should be:
Soon, we found out that doing that was a mistake because we were using much more bandwidth than we initially expected. Remember: usually, you don’t need all the information on your documents to render your web pages, but a subset of it. I encourage you to emit arrays containing only the desired information. Like this:
And, just in those cases where having all the information is mandatory, you can use a view like this:
And call it passing the “include_docs=true” argument. For instance:
Then you will get something like this:
Notice the new key in each row of the result, named “doc”. It will contain the whole document.
Concerning project management
I am a developer, and a geek, and a motorcycle fan and I like to learn lots of new things. But I am also an experienced Project Manager and CTO. And I am very concerned about learning curves, development times and costs.
Well... I have to say that the development team at Presive consisted in 4 people, including myself. Two of us were involved full-time in the project. The other two people were part-time interns. We all together managed to get Presive.com from zero to up and running in just 3,5 weeks. It took us 6 weeks to customize Satchmo. Keep this in mind.
function (doc) {
if(doc.jsonType == "user") {
emit(doc.email, [doc.user.givenName, doc.user.familyName]);
}
}
Take a look back at the user document example. Specifying the table you want to retrieve data from is equivalent to the if statement above. The result of the query will only contain information for documents having that jsonType. Then, choosing the “columns” you want to retrieve is equivalent to specifying the “value” in the emit function (in this case, an array).
If you create a view like that on your couchbase administration website under “_design/dev_test/_view/test”, you can get the results from it by navigating to:
http://yourdomain.com:8092/default/_design/dev_test/_view/test?connection_timeout=60000&limit=10&skip=0&key=”p.casado.arias@gmail.com”
Finally, the information you’ll get from couchbase will be like this:
{
"total_rows":77830,
"rows": [
{
"id":"44fb07ca-69ee-4fc1-bb44-e0b13f3d08d8",
"key":"p.casado.arias@gmail.com",
"value":["Pablo","Casado"]
}
]
}
Couchbase will always return the document ID and then, the key and the value you emitted. By passing the “key” argument in the previous URL, you are indicating to the database engine to return only information from the document having the matching key. In other words: you’ll only get results if the the first argument in the “emit” call matches the value of the key argument in the URL.
Stock control
The underlying business model of Presive.com required a 15-minute reservation cart for each product a user added to the cart. This means that, if you have added a product to your shopping cart, you have 15 minutes to go through the checkout process (or add another item to the cart) before your reservation expires.
This made the stock control process more complex than usual. It involved not only decreasing the actual stock for each SKU after a purchase is completed, but also when users add a product to their carts. In order to deal with that, we created a set of documents having jsonType = cart, containing the ID of the user and and a list of the items added to it. Those were created when a user added a product to the cart. See the example below:
{
"_id": "011de442eace65a9c0de7ec1a1b86ad1",
"_rev": "1-0010fc329c910b380000014100000000",
"$flags": 0,
"$expiration": 0,
"jsonType": "cart",
"user": "852da85a-a730-4d40-ac67-b3e28059345b",
"created_at": "2012-07-12T09:11:21+00:00",
"updated_at": "2012-07-12T09:11:21+00:00",
"items": {
"bussoga-ceramic-tile-mural": {
"added_at": "2012-07-12T09:11:21+00:00",
"quantity": 1
}
}
}
To close the loop, a new “reduce” view was created. The reduce function was just “_sum”. And the map function looked like this:
function (doc) {
if(doc.jsonType == 'cart') {
for(item in doc.items) {
emit([item, doc.updated_at], doc.items[item].quantity);
}
}
}
The aim of this view was to get the amount of items in every single cart document. To achieve this goal, we used the startkey and endkey arguments of a Couchbase view call, like this:
http://yourdomain.com:8092/default/_design/dev_stock/_view/reservations?startkey=["bussoga-ceramic-tile-mural", "2012-01-01T00:00:00+00"]&endkey=["bussoga-ceramic-tile-mural", "2012-08-01T23:59:59+00"]&group=true&group_level=1
What we are doing here is specifying a startkey and an endkey (couchbase will only return keys between those values), and ask for a level 1 reduction. In plain english: We are asking couchbase to add (_sum) the emitted values (quantities in cart) of the carts containing the product and created between those two dates.
You can do a lot of magic by emitting complex keys in your map functions. Start and end keys will always be sorted alphanumerically according to unicode. Read more about sorting here: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-querying-ordering.html
So, each time we needed to retrieve the stock of a product, we called the previous view by indicating the ID of the product and two datetimes string: one containing a “fifteen minutes ago” and the other containing a “now”.
Couchbase is perfect for reporting queries but, apparently, it is not so good when trying to emulate the “commit” mechanism of a relational database. There is no such a thing as COMMIT, BEGIN TRANS or END TRANS. If you want to get the current contents of the view, they won’t be available until the next time you call the view. Let’s use an example to explain this.
1.- Call view. View returns 1, 2 and 3
2.- Insert 4
3.- Call view. Still returns 1, 2 and 3.
4.- Call view. Now, it returns 1, 2, 3 and 4
This is caused because every view created a index when it is updated and just after returning the results. Couchbase uses that index to return the stored values very, very fast. After the operation finished, it updated the index again. In the previous example, The index is updated after step 3. Not before step 3 or during the execution of step 3. This is why you get the “wrong” values.
How to deal with this in a almost-real-time environment like the 15-minute reservation explained before? Well... the answer is simple. Just add stale=false to the call, like this:
http://yourdomain.com:8092/default/_design/dev_stock/_view/reservations?startkey=["bussoga-ceramic-tile-mural", "2012-01-01T00:00:00+00"]&endkey=["bussoga-ceramic-tile-mural", "2012-08-01T23:59:59+00"]&group=true&group_level=1&stale=false
This will force Couchbase to “update the index” before returning you any results. Then, you’ll get the expected, traditional results in step 3 above. But be aware of using this carefully and only in queries that require it, because it will decrease the performance of your database.
Where is my JOIN?
By choosing the described “join-cluster” approach when defining the structure of your data documents, you’ll stay away from headaches involving joins and couchbase. A newbie would say that it is impossible to “join” in Couchbase, as there are no foreign keys and the map functions cannot access all documents several times.
The first (and naïf) approach would consist of retrieving the designer by ID and then, query Couchbase again to get all the products by their designer. The “products by designer” view would look like this:
function (doc) {
if(doc.jsonType == 'product') {
emit(doc.designerId, [doc.id, doc.name, doc.images, doc.designer])
}
}
(Note: doc.images is an array)
If you need to retrieve all the products of a single designer, you can send the id of the designer like this:
http://yourdomain.com:8092/default/_design/dev_products/_view/by_designer?key=”bussoga”
And, if you need to retrieve all the products from several designers, you can go like this:
http://yourdomain.com:8092/default/_design/dev_products/_view/by_designer?keys=["atelier-melis","bussoga"]
SQL-like speaking:
- key is equal to “WHERE designerId = ‘some_designer_id’”
- keys (notice the array!) is equal to “WHERE designerId IN some_list_of_designer_ids”
This can be useful in many situations, but it wasn’t exactly what we were trying to achieve. Let’s say that we don’t want to be naïf and that we don’t want to query the database many times by the use of some ignote join-like operator. There is a technique named View Collation that will emulate the inner join operation. Trust me: view collation is your friend. I learned it from here: http://wiki.apache.org/couchdb/View_collation
To keep ourselves focused: the idea is to retrieve the designer and all his/her products by querying the database only once. Then, you’ll need to create a view like this one:
function (doc) {
if(doc.jsonType=="designer"){
emit([doc._id, 0], [doc.title, doc.contact.city]);
}else if(doc.jsonType == "product"){
emit([doc._id, 1], [doc.id, doc.name, doc.images]);
}
}
And call it like this:
http://yourdomain.com:8092/default/_design/dev_test/_view/test_collide?startkey=["bussoga",0]&endkey=["bussoga\ufff0",1]
Because of the UTF-8 sorting implemented in Couchbase, the first element in the result rows will be the designer having the specified key. Then, the following elements will be products designed by that designer. You have your “join”, now. This is a sample result:
{
"total_rows": 30,
"rows": [
{
"id": "bussoga",
"key": ["bussoga", 0],
"value": ["Bussoga", "Girona"]
},
{
"id": "bussoga-ceramic-tile-mural",
"key": ["bussoga-ceramic-tile-mural", 1],
"value": ["bussoga-ceramic-tile-mural", "Ceramic Tile Mural", ["1.JPG", "3.JPG"]]
},
{
"id": "bussoga-fake-product",
"key": ["bussoga-fake-product", 1],
"value": ["bussoga-fake-product", "Bussoga Fake Product", ["SFM_skull3.jpg"]]
}
]
}
On returning all the contents of the docs
In a very early stage of the development, we had views that returned the whole document as the emitted value. The simplest view like this should be:
function (doc) {
emit(doc._id, doc);
}
Soon, we found out that doing that was a mistake because we were using much more bandwidth than we initially expected. Remember: usually, you don’t need all the information on your documents to render your web pages, but a subset of it. I encourage you to emit arrays containing only the desired information. Like this:
function (doc) {
emit(doc._id, [doc.attr1, doc.attr2, doc.attr3, doc.attr4]);
}
And, just in those cases where having all the information is mandatory, you can use a view like this:
function (doc) {
emit(doc._id, null);
}
And call it passing the “include_docs=true” argument. For instance:
http://yourdomain.com:8092/default/_design/dev_all/_view/all_docs?&include_docs=true
Then you will get something like this:
{
"total_rows": 55908,
"rows": [
{
"id": "000bee1b-6453-4505-a0b0-351ecb69c361",
"key": "000bee1b-6453-4505-a0b0-351ecb69c361",
"value": null,
"doc": {
"_id": "000bee1b-6453-4505-a0b0-351ecb69c361",
"_rev": "1-00106aaf7d628d2500000b6a00000000",
"$flags": 0,
"$expiration": 0,
...
},
...
}
]
}
Notice the new key in each row of the result, named “doc”. It will contain the whole document.
Concerning project management
I am a developer, and a geek, and a motorcycle fan and I like to learn lots of new things. But I am also an experienced Project Manager and CTO. And I am very concerned about learning curves, development times and costs.
Well... I have to say that the development team at Presive consisted in 4 people, including myself. Two of us were involved full-time in the project. The other two people were part-time interns. We all together managed to get Presive.com from zero to up and running in just 3,5 weeks. It took us 6 weeks to customize Satchmo. Keep this in mind.