Department of Electrical and Computer
Engineering
University of Central Florida
Orlando, FL 32816
USA
Email: {lch, shu, wu}@ece2.engr.ucf.edu
In this paper, we present the performance result and analysis of parallel CBR video servers with conflict-free scheduling algorithms and server synchronization at time slot and time cycle levels. The experimental outcome shows that by relaxing server-level synchronization to a time cycle, the conflict-free slot scheduling produces higher stream throughput than the cycle scheduling.
Parallel video server, CBR, Synchronization.
The motivation of design of parallel video servers to accommodate large number of requests is well established. To balance the workload, the data blocks are normally wide striped across nodes in a round-robin fashion. Without proper server-level synchronization, not only the arrival of data blocks may become out of order, more importantly, the workload between nodes eventually become unbalanced. The resultant system performance jeopardizes the QoS of video-on-demand system.
In order to synchronize the order of data transmission among clustered nodes in a server, server-level synchronization is applied. In MARS [1], tightly-coupled synchronization between server nodes is performed by customed APICs. In Tiger [2], a separate network connecting all nodes is used to facilitate the synchronization task. Although there is no synchronization problem in video servers with a client-pull model where a client explicitly sends a request to a particular node of the server for a specific video data block [3], the unbalanced workload between nodes may result in poor system throughput.
In this paper we study the necessity of server-level synchronization with different scheduling algorithms at time slot and time cycle levels on our parallel CBR video server [4], where video streams are scheduled in Conflict-Free Scheduling (CFS) to optimize the system performance by eliminating resource conflict. Different from other works, the centralized server-level synchronization mechanism is done by multiplexing data and control packets on switched Fast Ethernet.
Our parallel video server, Odyssey, is a cluster of PCs connected by a fast Ethernet switch. There are three kinds of nodes in Odyssey, storage nodes, delivery nodes and a control node. The storage node is responsible for storing video clips, retrieving requested data blocks and sending them to delivery nodes within a time limit. Each storage node deals with its own disk scheduling algorithm separately to provide enough bandwidth such that the system bottleneck is assumed mainly on network transmission bandwidth. In order to decouple system modules of disk access and network transmission, a ping-pong buffer is applied. In addition, partitioned video blocks are wide striped among storage nodes in a round-robin fashion to balance the workload. The delivery node, on the other hand, is responsible for taking requests from clients then forwarding to the control node for scheduling. Video blocks sent by storage nodes are buffered in delivery nodes, where video blocks are re-sequenced if necessary, and then sent to clients. Admission control, network scheduling for retrieving video data from storage nodes to delivery nodes, network synchronizer for synchronizing data retrieval between nodes, and content manager for maintaining video clips are resided in the control node. Upon receiving requests from delivery nodes, the control node schedules requests to the next time cycle if admitted. Unless receiving further notices from clients, the scheduler keeps updating the current schedule table and sends it to storage nodes for retrieving video blocks until reaching their ends. It is a server-push model. Note that the flat architecture where a logical storage node and a logical delivery node are mapped to a single physical node, called a processing node, is applied in Odyssey.
Although data blocks are wide striped to storage nodes in a round-robin fashion, without properly scheduling of data retrieval, resources conflict such as port contention where two storage nodes are transmitting to a single delivery node or disk contention where more than one request retrieves blocks from the same storage node at the same time may occur. Such a resource conflict will lead to a poor stream throughput. In [4], we proposed a conflict-free stream scheduling algorithm to balance the workload between nodes and eliminate contentions so that high system performance and stream throughput are achieved. This scheduling algorithm also guarantees that once the first cycle has a conflict-free scheduling, the following cycles will not have conflict.
The granularity of scheduling video streams can be categorized as two types, cycle scheduling and slot scheduling if a time cycle is further partitioned into a fixed number of time slots. These two scheduling schemes are discussed as follows.
Slot Scheduling. Assume that there are N processing nodes. Given a selected block size s and the client decoding rate R, time is divided into time cycles, where the length of a time cycle is tf = s/R. The time to access N blocks, one from each storage node is called a time period. In general, the data transfer rate of the network can be much higher than the decoding rate R. Therefore, in a time cycle multiple video streams can be serviced by a storage node while the individual stream rate is still preserved. The time cycle is thus divided into time slots. Assume that the length of a time slots, ts, represents the time required for transmitting a video block from a storage node to the delivery node. The number of time slots in a time cycle, m, then is determined by m = floor(tf/ ts). Figure 1 shows an example of a complete video schedule in a time period. To ensure conflict-free transmission, one server-level synchronization is needed at the end of each time slot. Notice that the shaded requests require video data to be retrieved and transmitted to the same processing node in flat architecture, i.e., the network bandwidth to this node is free at this time slot. In this case, requested video data are transferred through shared memory. This surplus network bandwidth can be used for system maintenance such as adding or deleting video files; or increasing the stream throughput. It is easy to show that the aggregated surplus network bandwidth is always 1/N of aggregated network bandwidth in a time period for each processing node.

Figure 1. A complete schedule for slot scheduling in a time
period
Cycle Scheduling. During a time cycle, instead of multiplexing data transmission among N delivery nodes in slot scheduling, with cycle scheduling a storage node only transmits data to one delivery node. Note that there is only one server-level synchronization needed at the end of a time cycle. Furthermore, a starting sequence which designates the transmission order between storage nodes and delivery nodes is assigned to the schedule table at the first time cycle. An example of a complete video schedule for cycle scheduling with a starting sequence of {3,0,2,1} is shown in Figure 2. The motivation of cycle scheduling is to reduce the overhead of synchronization in a time cycle and the complexity of conflict-free scheduling and admission control. The major drawback is that it suffers from the longer average initial delay and the inflexibility of scheduling interactive operations such as fast-forward.

Figure 2. A complete schedule for cycle scheduling in a time
period
There are two kinds of server-level synchronizations, cycle synchronization and slot synchronization. Cycle synchronization is to ensure the order of data arrival in clients is correct while slot synchronization is to eliminate port contentions. In Odyssey, we apply a simple centralized synchronization mechanism to facilitate the task. When a storage node finishes transmitting the video data in a time slot, a synchronization signal SYN is sent to syncer which is essentially acting as a barrier resided in the control node. Once the syncer received SYNs from storage nodes, it broadcasts a START signal to storage nodes for transmitting data of the next time slot.
Apparently, each synchronization adds a certain amount of system overhead to the server. Since in Odyssey the length of transmitted data is constant in each time slot and port contention is already eliminated by CFS, we may relax time slot synchronization to time cycle level. In this case, a small duration of port contention at each time slot may occur. Since most of modern switches have a built-in buffer space, this kind of conflict can be accommodated.
Three kinds of synchronization schemes considered in this paper are described as follows.
Assume that a schedule table is represented as an N(m matrix. To generate a schedule table, one of the N! permutations is filled in each column. Among those sequences, there are two special sequences called Ordered (ORD) and Full-Duplex (FD) sequences. The ORD sequence guarantees that at any time all nodes either transmit data to itself or to each other. The FD sequence guarantees that at any time a node either transmits data to itself or pairs with another node transmitting data to each other. ORD sequence guarantees to generate surplus network bandwidth for all nodes at the same time while only one node has surplus network bandwidth at any time if N is odd with FD sequence [5]. When N=3, which is the test system configuration used in this paper, the N! permutations are happened to split into a set of ORD sequence and a set of FD sequence.
In order to measure the overhead and justify the necessity of server-level synchronization, three synchronization schemes described in the previous section with schedule tables of ORD and FD sequences are tested. To generate a FD schedule table for slot scheduling, first, a full-duplex sequence is assigned to the first time slot of the schedule table. Then, the rest of time slots in the table are filled in a round-robin fashion according to the sequence of first column such that the surplus network bandwidths are uniformly distributed among nodes. A ORD schedule table can be generated in a similar manner. Figure 3 shows the generated ORD and FD schedule tables. With cycle scheduling, each column is filled with the same ORD and FD sequences to generate ORD and FD schedule tables, respectively. The tested ORD and FD schedule tables for cycle scheduling are shown in Figure 4.

Figure 3. Schedule tables for slot scheduling

Figure 4. Scheduling tables for cycle scheduling
In addition, two baseline video servers are implemented for comparison. The first one is called total conflict where each storage node always transmits data to the same delivery node within a time slot. This kind of schedule table produces the maximum port contention on switches. The second one is a client-pull video server where no scheduling algorithm is applied.
The experiment is conducted on our prototype of Parallel CBR video server, Odyssey. It is a flat, cluster architecture consisting of three Pentium II 333 PCs, running the RedHat 5.1 Linux operating system. Each node is equipped with 64MB RAM and three UW SCSI disks. These three nodes are connected by a CISCO CATAYSTA 2916M Fast Ethernet Switch. Video clips are partitioned as 176KBytes video blocks and wide striped into three processing nodes in a round-robin fashion. Here, the length of a video block represents one second of playback time and the video length tested is 300 seconds.
The procedure to measure the stream throughput of various synchronization schemes is described as follows. At the starting time a pre-specified schedule table is sent to storage nodes. Once storage nodes receive the schedule table, it starts to transmit data blocks to delivery nodes. To save network bandwidth, the schedule table is updated locally at the end of each time cycle. Each client records the block arriving time after receiving the first block. A block is considered as missed if the arriving time does not meet the deadline of that block. The stream throughput, in terms of maximum m, is measured as no more 5% blocks transmitted by all nodes are missed during the entire rounds for servicing these N×m clients. Table 1 shows the stream throughput regarding to various scheduling and synchronization schemes.
| Scheduling | cycle | slot | total conflict | client pull | ||||
| Synchronization | cycle | slot | cycle | slot | n/a | |||
| Access pattern | ORD | FD | ORG | FD | ORD | FD | n/a | n/a |
| Maximum m | 60 | 47 | 54 | 45 | 64 | 51 | 21 | 36 |
For cycle scheduling with cycle synchronization, the stream throughputs of a server with ORD and FD schedule tables are 60 and 47, respectively. The increased stream throughput is mainly because with ORD schedule table, surplus network bandwidth that occurs at the same time cycle has been used for increasing stream throughput. With FD schedule table, there is exactly one node producing surplus network bandwidth at each time cycle. However, since the server synchronization is performed at the end of each time cycle, this idle node without any network transmission has to wait until the other two nodes finish their network transmissions. Although with FD schedule table the server has the same amount of surplus network bandwidth as ORD schedule table, since it does not happen at the same time cycle, it cannot be used for increase of stream throughput.
The same argument is also true in slot scheduling with slot synchronization. With ORD schedule table all nodes produce surplus network bandwidth at the same time slot while with FD schedule table exactly only one node produces surplus network bandwidth. The experimental result shows that ORD and FD schedule tables have stream throughput of 54 and 45, respectively. Again, the server synchronization at each time slot prevents a server with FD schedule table from utilizing surplus network bandwidth for increasing stream throughput.
Furthermore, regarding to slot scheduling with cycle synchronization, servers with ORD and FD schedule tables have stream throughput of 64 and 51, respectively. Clearly, a server with cycle synchronization has higher stream throughput than that with slot synchronization. The improvement of throughput is mainly due to the elimination of the server synchronization overhead in a time cycle. This leads us a conclusion that with conflict-free scheduling it is not necessary to synchronize network transmission at each time slot. Although there may have a small duration of port contention due to the clock drifting caused by the synchronization overhead, most of modern switches have enough buffer space to accommodate some conflict.
It is interesting to see why ORD schedule table in slot scheduling with cycle synchronization has higher throughput than that in cycle scheduling. Recall that in slot scheduling with ORD, for a storage node data blocks are uniformly transmitted among three delivery nodes while in cycle scheduling data blocks are transmitted only to one delivery node. Since each connection between a storage node and a delivery node contributes a fixed amount of buffer for data transmission, the total buffer space used for data transmission in a time cycle for slot scheduling with ORD is three times of that for cycle scheduling with ORD. The additional buffer space slightly increases the throughput of data transmission.
Without proper scheduling, the stream throughput of video servers with total conflict schedule table and the client-pull video server are only 21 and 36, respectively. It is observed that the degradation of stream throughput of the client-pull server is mainly due to the unbalanced workload of storage nodes.
In this paper, performance study of three synchronization schemes for conflict-free scheduling which are cycle scheduling with cycle synchronization, slot scheduling with slot synchronization and slot scheduling with cycle synchronization is explored. The experimental outcome not only justifies the necessity of conflict-free scheduling but also shows that by relaxing server-level synchronization to a time cycle, slot scheduling produces better stream throughput than cycle scheduling.