NATIONAL SCIENCE FOUNDATION
TOKYO REGIONAL OFFICE


The National Science Foundation's (NSF) Tokyo Office periodically receives and disseminates reports on research developments in Japan that are related to the Foundation's mission. NSF-sponsored researchers currently working in Japan prepare many of these reports. These reports provide information for use by the global science and engineering community.


Special Scientific Report #98-15 (October 5, 1998)



Study on Distributed File Systems



Mr. Wai Gen Yee, a Ph.D. student in the College of Computing at the Georgia Institute of Technology in Atlanta, Georgia, prepared the following report. Mr. Yee was a participant in the 1998 Summer Institute sponsored by NSF/NIH/USDA and the Science and Technology Agency of Japan. Mr. Hiroshi Inamura, Research Engineer at the Nippon Telephone and Telegraph (NTT) Information & Communication Systems Laboratories in Yokosuka, hosted Mr. Yee. Mr. Yee can be reached via email at: waigen@cc.gatech.edu


I studied the way file systems are benchmarked, and studied their strengths weaknesses and suggested a way to improve them. I worked at the Nippon Telephone and Telegraph's systems labs in Yokosuka, Japan. The studies were conducted on two Pentium boxes running FreeBSD UNIX over the local office LAN.

I discussed current file systems and their benchmarking with my host, Mr. Hiroshi Inamura. Some interesting file systems include NFS by Sun Microsystems, AFS by Carnegie-Mellon University, and CODA, also by Carnegie-Mellon University. It was important to discuss these file systems, because the differences they possess will give clues on their relative performance in the benchmarks. Studying the file systems also gives information on motivation behind benchmark design.

Next, we discussed benchmarking methods. There seem to be two general ones--synthetic and trace. Synthetic benchmarks by themselves do not give information on real-world performance. They just tell how quickly basic operations, such as open, close, read and write, perform. On the other hand, traces are replays of real activity on a file system. However, replaying the activity of one user-group on a file system lacks generality, since that one user-group may perform special file system operations that another may not, so the results of a trace may be meaningful for one user-group but not another.

My solution was to generate a hybrid benchmark, based on the Andrew Benchmark by CMU. A hybrid benchmark combines both trace and synthetic benchmarks. A trace is collected, and its contents analyzed. The result of this analysis would generate weights for each of the components of the synthetic benchmark. The weighted components are summed, yielding a score that is meaningful to a particular file-system user-group.

As an aside, I did some research on database systems. My Ph.D. work involves something called data aggregation. Some servers in client-server databases update clients by preparing a unique update file for each of them. The complexity (amount of work) of the server, in some cases, is proportional to the number of files generated. The solution that we have developed at Georgia Tech involves aggregating multiple update files into single larger update files that multiple clients share. In this case, the server has to generate fewer files, and, thus, do less work.

Occasionally, I discussed data aggregation ideas with operating systems researchers, visited database labs around Japan, and gave seminars. Interestingly, talking with the systems researchers was in a way more valuable, because they asked different questions than database researchers would. These questions, and my responses to them are included in my final report to NTT.

Next steps:

There are three ways in which I will bring my work in Japan back to Georgia Tech. First, in dealing with file system benchmarking, I understand better its pros, cons, applicability and design rationale. One of the problems I have been having with my work at Tech is in designing reasonable experiments that would convince people of the advantages and disadvantages of data aggregation. As I mentioned above, I worked on designing a hybrid benchmark for file systems this past summer, and will try to apply the ideas behind it to a benchmark for databases. Some initial steps in doing this involve first studying preexisting database benchmarks for more inspiration, determining distinct database function classes (such as add, update, delete, or join operations), and, finally, determining how to each of these functions map to real-world database operation.

The second benefit from this summer deals with my improved sense of my work, resulting from fresh criticisms I have received. To be sure, good science requires the researcher to pit his work against any criticism. Most of the database-style criticisms have already been posed by my own advisor, but talking with people outside of my field, and outside of academia, I have broadened the scope of my work. Indeed, my Japanese colleagues have given me new ideas and have pointed out significant limitations in my work. The papers that come out of my work will benefit greatly in breadth, depth and conviction from this experience. I am currently still sifting through what I have learned, and attempting to record everything in an organized manner.

Finally, by talking with other students while visiting universities, I gleaned some other ways of tackling my problem based on their work. One student in particular at Kyoto University mentioned his work in "query containment," which is basically the study of functions performed on a database in an attempt to determine if one function might include the functionality of another. If one query contains another, it seems that the result of the query would be a superset of the contained query as well. I may be able to apply this to the idea data aggregation. Data in a database server can be grouped based on certain predicates. If I can determine how one of these predicates may contain others, then, I may have some tool that can help in automating the data aggregation process, which is one of my research goals. I plan on communicating with this student this year.

In the near future, I would like to return to Japan to discuss the fruits of my visit with my Japanese colleagues.


Click here to return to top of this report