发布于 2015-09-14 14:54:20 | 155 次阅读 | 评论: 0 | 来源: 网络整理
This document describes the basic patterns and principles for designing an E-Commerce product catalog system using MongoDB as a storage engine.
Product catalogs must have the capacity to store many differed types of objects with different sets of attributes. These kinds of data collections are quite compatible with MongoDB’s data model, but many important considerations and design decisions remain.
For relational databases, there are several solutions that address this problem, each with a different performance profile. This section examines several of these options and then describes the preferred MongoDB solution.
One approach, in a relational model, is to create a table for each product category. Consider the following example SQL statement for creating database tables:
CREATE TABLE `product_audio_album` (
`sku` char(8) NOT NULL,
...
`artist` varchar(255) DEFAULT NULL,
`genre_0` varchar(255) DEFAULT NULL,
`genre_1` varchar(255) DEFAULT NULL,
...,
PRIMARY KEY(`sku`))
...
CREATE TABLE `product_film` (
`sku` char(8) NOT NULL,
...
`title` varchar(255) DEFAULT NULL,
`rating` char(8) DEFAULT NULL,
...,
PRIMARY KEY(`sku`))
...
This approach has limited flexibility for two key reasons:
Another relational data model uses a single table for all product categories and adds new columns anytime you need to store data regarding a new type of product. Consider the following SQL statement:
CREATE TABLE `product` (
`sku` char(8) NOT NULL,
...
`artist` varchar(255) DEFAULT NULL,
`genre_0` varchar(255) DEFAULT NULL,
`genre_1` varchar(255) DEFAULT NULL,
...
`title` varchar(255) DEFAULT NULL,
`rating` char(8) DEFAULT NULL,
...,
PRIMARY KEY(`sku`))
This approach is more flexible than concrete table inheritance: it allows single queries to span different product types, but at the expense of space.
Also in the relational model, you may use a “multiple table inheritance” pattern to represent common attributes in a generic “product” table, with some variations in individual category product tables. Consider the following SQL statement:
CREATE TABLE `product` (
`sku` char(8) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
`price`, ...
PRIMARY KEY(`sku`))
CREATE TABLE `product_audio_album` (
`sku` char(8) NOT NULL,
...
`artist` varchar(255) DEFAULT NULL,
`genre_0` varchar(255) DEFAULT NULL,
`genre_1` varchar(255) DEFAULT NULL,
...,
PRIMARY KEY(`sku`),
FOREIGN KEY(`sku`) REFERENCES `product`(`sku`))
...
CREATE TABLE `product_film` (
`sku` char(8) NOT NULL,
...
`title` varchar(255) DEFAULT NULL,
`rating` char(8) DEFAULT NULL,
...,
PRIMARY KEY(`sku`),
FOREIGN KEY(`sku`) REFERENCES `product`(`sku`))
...
Multiple table inheritance is more space-efficient than single table inheritance and somewhat more flexible than concrete table inheritance. However, this model does require an expensive JOIN operation to obtain all relevant attributes relevant to a product.
The final substantive pattern from relational modeling is the entity-attribute-value schema where you would create a meta-model for product data. In this approach, you maintain a table with three columns, e.g. entity_id, attribute_id, value, and these triples describe each product.
Consider the description of an audio recording. You may have a series of rows representing the following relationships:
Entity | Attribute | Value |
---|---|---|
sku_00e8da9b | type | Audio Album |
sku_00e8da9b | title | A Love Supreme |
sku_00e8da9b | ... | ... |
sku_00e8da9b | artist | John Coltrane |
sku_00e8da9b | genre | Jazz |
sku_00e8da9b | genre | General |
... | ... | ... |
This schema is totally flexible:
The downside for these models, is that all nontrivial queries require large numbers of JOIN operations that results in large performance penalties.
Additionally some e-commerce solutions with relational database systems avoid choosing one of the data models above, and serialize all of this data into a BLOB column. While simple, the details become difficult to access for search and sort.
Because MongoDB is a non-relational database, the data model for your product catalog can benefit from this additional flexibility. The best models use a single MongoDB collection to store all the product data, which is similar to the single table inheritance relational model. MongoDB’s dynamic schema means that each document need not conform to the same schema. As a result, the document for each product only needs to contain attributes relevant to that product.
At the beginning of the document, the schema must contain general product information, to facilitate searches of the entire catalog. Then, a details sub-document that contains fields that vary between product types. Consider the following example document for an album product.
{
sku: "00e8da9b",
type: "Audio Album",
title: "A Love Supreme",
description: "by John Coltrane",
asin: "B0000A118M",
shipping: {
weight: 6,
dimensions: {
width: 10,
height: 10,
depth: 1
},
},
pricing: {
list: 1200,
retail: 1100,
savings: 100,
pct_savings: 8
},
details: {
title: "A Love Supreme [Original Recording Reissued]",
artist: "John Coltrane",
genre: [ "Jazz", "General" ],
...
tracks: [
"A Love Supreme Part I: Acknowledgement",
"A Love Supreme Part II - Resolution",
"A Love Supreme, Part III: Pursuance",
"A Love Supreme, Part IV-Psalm"
],
},
}
A movie item would have the same fields for general product information, shipping, and pricing, but have different details sub-document. Consider the following:
{
sku: "00e8da9d",
type: "Film",
...,
asin: "B000P0J0AQ",
shipping: { ... },
pricing: { ... },
details: {
title: "The Matrix",
director: [ "Andy Wachowski", "Larry Wachowski" ],
writer: [ "Andy Wachowski", "Larry Wachowski" ],
...,
aspect_ratio: "1.66:1"
},
}
注解
In MongoDB, you can have fields that hold multiple values (i.e. arrays) without any restrictions on the number of fields or values (as with genre_0 and genre_1) and also without the need for a JOIN operation.
For most deployments the primary use of the product catalog is to perform search operations. This section provides an overview of various types of queries that may be useful for supporting an e-commerce site. All examples in this document use the Python programming language and the PyMongo driver for MongoDB, but you can implement this system using any language you choose.
This query returns the documents for the products of a specific genre, sorted in reverse chronological order:
query = db.products.find({'type':'Audio Album',
'details.genre': 'jazz'})
query = query.sort([('details.issue_date', -1)])
To support this query, create a compound index on all the properties used in the filter and in the sort:
db.products.ensure_index([
('type', 1),
('details.genre', 1),
('details.issue_date', -1)])
注解
The final component of the index is the sort field. This allows MongoDB to traverse the index in the sorted order to preclude a slow in-memory sort.
While most searches will be for a particular type of product (e.g album, movie, etc.,) in some situations you may want to return all products in a certain price range, or discount percentage.
To return this data use the pricing information that exists in all products to find the products with the highest percentage discount:
query = db.products.find( { 'pricing.pct_savings': {'$gt': 25 })
query = query.sort([('pricing.pct_savings', -1)])
To support this type of query, you will want to create an index on the pricing.pct_savings field:
db.products.ensure_index('pricing.pct_savings')
Since MongoDB can read indexes in ascending or descending order, the order of the index does not matter.
注解
If you want to preform range queries (e.g. “return all products over $25”) and then sort by another property like pricing.retail, MongoDB cannot use the index as effectively in this situation.
The field that you want to select a range, or perform sort operations, must be the last field in a compound index in order to avoid scanning an entire collection. Using different properties within a single combined range query and sort operation requires some scanning which will limit the speed of your query.
Use the following query to select documents within the details of a specified product type (i.e. Film) of product (a movie) to find products that contain a certain value (i.e. a specific actor in the details.actor field,) with the results sorted by date descending:
query = db.products.find({'type': 'Film',
'details.actor': 'Keanu Reeves'})
query = query.sort([('details.issue_date', -1)])
To support this query, you may want to create the following index.
db.products.ensure_index([
('type', 1),
('details.actor', 1),
('details.issue_date', -1)])
This index begins with the type field and then narrows by the other search field, where the final component of the index is the sort field to maximize index efficiency.
Regardless of database engine, in order to retrieve this information the system will need to scan some number of documents or records to satisfy this query.
MongoDB supports regular expressions within queries. In Python, you can use the “python:re” module to construct the query:
import re
re_hacker = re.compile(r'.*hacker.*', re.IGNORECASE)
query = db.products.find({'type': 'Film', 'title': re_hacker})
query = query.sort([('details.issue_date', -1)])
MongoDB provides a special syntax for regular expression queries without the need for the re module. Consider the following alternative which is equivalent to the above example:
query = db.products.find({
'type': 'Film',
'title': {'$regex': '.*hacker.*', '$options':'i'}})
query = query.sort([('details.issue_date', -1)])
The $options operator specifies a case insensitive match.
The indexing strategy for these kinds of queries is different from previous attempts. Here, create an index on { type: 1, details.issue_date: -1, title: 1 } using the following command at the Python/PyMongo console:
db.products.ensure_index([
('type', 1),
('details.issue_date', -1),
('title', 1)])
This index makes it possible to avoid scanning whole documents by using the index for scanning the title rather than forcing MongoDB to scan whole documents for the title field. Additionally, to support the sort on the details.issue_date field, by placing this field before the title field, ensures that the result set is already ordered before MongoDB filters title field.
Database performance for these kinds of deployments are dependent on indexes. You may use sharding to enhance performance by allowing MongoDB to keep larger portions of those indexes in RAM. In sharded configurations, select a shard key that allows mongos to route queries directly to a single shard or small group of shards.
Since most of the queries in this system include the type field, include this in the shard key. Beyond this, the remainder of the shard key is difficult to predict without information about your database’s actual activity and distribution. Consider that:
In the following example, assume that the details.genre field is the second-most queried field after type. Enable sharding using the following shardCollection operation at the Python/PyMongo console:
>>> db.command('shardCollection', 'product', {
... key : { 'type': 1, 'details.genre' : 1, 'sku':1 } })
{ "collectionsharded" : "details.genre", "ok" : 1 }
注解
Even if you choose a “poor” shard key that requires mongos to broadcast all to all shards, you will still see some benefits from sharding, because:
While sharding is the best way to scale operations, some data sets make it impossible to partition data so that mongos can route queries to specific shards. In these situations mongos sends the query to all shards and then combines the results before returning to the client.
In these situations, you can add additional read performance by allowing mongos to read from the secondary instances in a replica set by configuring read preference in your client. Read preference is configurable on a per-connection or per-operation basis. In PyMongo, set the read_preference argument.
The SECONDARY property in the following example, permits reads from a secondary (as well as a primary) for the entire connection .
conn = pymongo.Connection(read_preference=pymongo.SECONDARY)
Conversely, the SECONDARY_ONLY read preference means that the client will only send read operation only to the secondary member
conn = pymongo.Connection(read_preference=pymongo.SECONDARY_ONLY)
You can also specify read_preference for specific queries, as follows:
results = db.product.find(..., read_preference=pymongo.SECONDARY)
or
results = db.product.find(..., read_preference=pymongo.SECONDARY_ONLY)
也可以参考