Streams

Processes use streams for all of their I/O operations. A stream is a sequence of bytes. The bytes can represent any kind of data, for example, text, images, video, audio. In introductory programming courses, streams are associated with files. A program reads or writes a stream of data from a file on a storage device. But we will see that streams are much more versatile. We will show how programs can read or write streams of data from other programs. In other words, we will see that streams can be used to implement Inter-process Communication.

Here are references to several online book chapters that review using streams, mostly for file I/O.

NOTE: The Java language now has two very different kinds of object that are called "streams". There are the traditional I/O streams that we introduce in this document. In addition, starting in Java 8, Java defined a Stream class that is an implementation of the Stream abstract data type, an idea that comes from functional programming languages. The new Stream class is not for doing I/O. The new Stream class provides a modern way to process data structures from the Java Collections Framework.

Here are the basic "stream" classes in Java. You can see that the java.util.stream.Stream class is nothing like the java.io.InputStream or java.io.OutputStream classes.

The Stream abstract data type is becoming an important part of modern programming languages. It plays a big part in modern Java.

Standard I/O Streams

When a process is created by the operating system, the process is always supplied with three open streams. These three streams are called the "standard streams". They are

  • standard input (stdin)
  • standard output (stdout)
  • standard error (stderr)

We can visualize a process as an object with three "connections" where data (bytes) can either flow into the process or flow out from the process.

                      process
                +-----------------+
                |                 |
        >------->>stdin    stdout>>-------->
                |                 |
                |          stderr>>-------->
                |                 |
                +-----------------+

A console application will usually have its stdin stream connected to the computer's keyboard and its stdout and stderr streams connected to the console window.

                      process
                +-----------------+
                |                 |
    keyboard --->>stdin    stdout>>------+---> console window
                |                 |      |
                |          stderr>>------+
                |                 |
                +-----------------+

It is important to realize that the above picture is independent of the programming language used to write the program which is running in the process. Every process looks like this. It is up to each programming language to allow programs, written in that language, to make use of this setup provided by the operating system.

Every operating system has its own way of giving a process access to the internal data structures the operating system uses to keep track of what each standard stream is "connected" to.

The Linux operating system gives every process three file descriptors,

    #define  STDIN_FILENO 0,  STDOUT_FILENO 1,  STDERR_FILENO 2

Linux provides the read() and write() system calls to let a process read from and write to these file descriptors.

The Windows operating system gives every process three handles. We retrieve the handles using the GetStdHandle() function with one of these input parameters.

     STD_INPUT_HANDLE, STD_OUTPUT_HANDLE, STD_ERROR_HANDLE

Windows provides the ReadFile() and WriteFile() system calls to let a process read from and write to these handles.

Every programming language must have a way of representing the three standard streams and every language must provide a way to read from the standard input stream and a way to write to the standard output and standard error streams.

For example, here is how the three standard I/O streams are represented by some common programming languages.

    Java uses Stream objects.
      java.io.InputStream  System.in
      java.io.PrintStream  System.out
      java.io.PrintStream  System.err
    These are static fields in the java.lang.System class.

    Standard C uses pointers to FILE objects.
      FILE* stdin;
      FILE* stdout;
      FILE* stderr;
    These are defined in the stdio.h header file.

    Python uses text File objects.
      sys.stdin
      sys.stdout
      sys.stderr
    These are in the sys module.

    C++ uses stream objects.
      istream std::cin;
      ostream std::cout;
      ostream std::cerr;
    These are defined in the <iostream> header.

    .Net uses Stream objects.
      System.IO.TextReader  Console.In
      System.IO.TextWriter  Console.Out
      System.IO.TextWriter  Console.Error
    These are static fields in the System.Console class.

The C language provides functions like getchar(), scanf(), and fscanf() to read from stdin and it provides printf() and fprintf() to write to stdout and stderr. On a Windows computer, the C language's printf() function will be implemented using Window's WriteFile() system call with the STD_OUTPUT_HANDLE handle. On a Linux computer, the C language's printf() function will be implemented using Linux's write() system call with the STDOUT_FILENO file descriptor.

I/O Redirection

Every process is created by the operating system at the request of some other process, the parent process. When the parent process asks the operating system to create a child process, the parent must tell the operating system how to "connect" the child's three standard streams. The parent telling the operating system how to connect the child's three standard streams is usually referred to as I/O redirection.

At a shell command prompt, if we type a command like this,

    > foo > result.txt

then the shell program (cmd.exe on Windows, or bash on Linux) is the parent process. The above command tells the shell process to ask the operating system to create a child process from the foo program. But in addition to asking the operating system to create the child process, the shell process also instructs the operating system to redirect the child process's standard output to the file result.txt. So when the foo process runs, it looks like this.

                   foo process
                +-----------------+
                |                 |
    keyboard --->>stdin    stdout>>----> result.txt
                |                 |
                |          stderr>>----> console window
                |                 |
                +-----------------+

Stdin and stderr have their default connections, and stdout is redirected to the file result.txt.

If we type a command like this,

    > foo > result.txt < data.txt

then the shell process will ask the operating system to create a child process from the foo program and also ask the operating system to redirect the child process's standard output to the file result.txt and redirect the child process's standard input to the file data.txt. So when foo process runs, it looks like this.

                   foo process
                +-----------------+
                |                 |
    data.txt --->>stdin    stdout>>----> result.txt
                |                 |
                |          stderr>>----> console window
                |                 |
                +-----------------+

Shared streams

When two processes share a stream, it is usually the case that one of the two processes is idle while the other process uses the shared stream (the idle process will often be waiting for the other process to terminate). If two processes are simultaneously using a shared stream, the results can be confusing and unpredictable.

If two processes simultaneously use an output stream, then their outputs will be, more or less, randomly intermingled in the stream's final destination. This can lead to unusable results.

If two processes simultaneously use an input stream, as in the following picture, then it is not the case that every input byte flows into each process. Each input byte can only be consumed by one of the two processes. Which process gets a particular byte of input depends on the ordering of when each process calls its read() function on the input stream. This is almost never a desirable situation. Processes almost never simultaneously use a shared input stream. Shared input streams are very common, but the two processes almost always have a way to synchronize their use of the stream so that they are never reading from it simultaneously. The most common way for two processes to share an input stream is for the parent process to wait for the child process to terminate. Then the parent process can resume reading from the input stream.

                       parent
                  +--------------+
                  |              |
           +----->>stdin  stdout>>------->
           |      |              |
           |      |       stderr>>--->
           |      |              |
           |      |              |
           |      +--------------+
     ------+
           |
           |              child
           |         +--------------+
           |         |              |
           +-------->>stdin  stdout>>------>
                     |              |
                     |       stderr>>--->
                     |              |
                     |              |
                     +--------------+

Pipes

If we type a command like this,

    > foo < data.txt | bar > result.txt

the shell process will ask the operating system to create two child processes, one from the foo program and the other from the bar program. In addition, the shell process will ask the operating system to create a pipe object and have the stdout of the foo process redirected to the input of the pipe, and have the stdin of the bar process redirected to the output of the pipe. Finally, the shell process will ask the operating system to redirect the bar process's standard output to the file result.txt and redirect the foo process's standard input to the file data.txt. So while this command is executing, it looks like this.

                 foo process                  bar process
              +---------------+            +---------------+
              |               |    pipe    |               |
    data.txt-->>stdin  stdout>>--========-->>stdin  stdout>>------> result.txt
              |               |            |               |
              |        stderr>>---+        |        stderr>>----+-> console window
              |               |   |        |               |    |
              +---------------+   |        +---------------+    |
                                  |                             |
                                  +-----------------------------+

In the above command, the two programs, foo and bar, are running simultaneously (in parallel) with each other. The pipe object acts as a "buffer" between the two processes. Whenever the foo process writes something to its output, that something gets put in the pipe "buffer". Then when the bar process wants to read some input data, it reads whatever is currently in the pipe "buffer". If the foo process writes data faster than the bar process reads data, then data accumulates in the pipe. When foo terminates, it may be that data still remains in the pipe, in which case bar will continue to run until it has emptied the pipe. On the other hand, if the bar process reads data out of the pipe much faster than foo writes data into the pipe, then the bar process will often find the pipe empty when bar wants to read some data. In that case, bar "blocks" and waits until some data shows up in the pipe. When the foo process writes its last bit of data to the pipe and then foo terminates, the operating system will let the bar process know that it has reached the "end-of-file" after the bar process reads the last bit of data from the pipe.

Here is another way to think about the above pipeline command. The shell process could run the two programs, foo and bar, sequentially, one after the other. In other words, the shell process could interpret this command,

    > foo < data.txt | bar > result.txt

as the following three commands.

    > foo < data.txt > temp
    > bar < temp > result.txt
    > del temp

These three commands would have a picture that looks like this.

                   foo process
                +-----------------+
                |                 |
    data.txt --->>stdin    stdout>>----> temp
                |                 |
                |          stderr>>----> console window
                |                 |
                +-----------------+

                   bar process
                +-----------------+
                |                 |
        temp --->>stdin    stdout>>----> result.txt
                |                 |
                |          stderr>>----> console window
                |                 |
                +-----------------+

First the foo process is executed with its output stored in a temporary file called temp. Then the bar process is run with its input coming from the temp file. Then the temp file gets deleted.

Notice that this sequential interpretation of the pipeline command might be considerably slower than the parallel interpretation. And since the sequential interpretation needs to store all the intermediate data in a temp file, the sequential interpretation may require far more storage space than the parallel interpretation.

Here is a more detailed picture of a Java process, its three standard streams, and their buffers. The "user space" buffers belong to Java classes and are used by Java methods. For example, the Scanner class, and all its methods, have a user space input buffer. The PrintWriter class, and its print(), println(), printf() methods, have a user space output buffer. (Note: C and C++ processes do not have a user space buffer for stderr.)

                                        Java process
                             +---------------------------------------+
                             |                                       |
                kernel space |  user space                user space |        kernel space
                +------+     |  +------+                  +------+   |        +------+
    keyboard -->|      |----->>-|      |->stdin   stdout->|      |-->>---+--->|      |---> console window
                +------+     |  +------+                  +------+   |   |    +------+
                 buffer      |   buffer                    buffer    |   |     buffer
                             |                                       |   |
                             |                            user space |   |
                             |                            +------+   |   |
                             |                    stderr->|      |-->>---+
                             |                            +------+   |
                             |                             buffer    |
                             +---------------------------------------+

Here is a sketch of two processes connected with a pipe and some of the associated buffers.

                  foo process                                    bar process
               +---------------+      user space              +---------------+
               |               |      +------+                |               |
    data.txt -->>stdin   stdout>>-----|      |--+        +--->>stdin   stdout>>----> result.txt
               |               |      +------+  |        |    |               |
               |               |       buffer   |        |    |               |
               |               |                |        |    |               |
               |        stderr>>--+    +--------+        |    |        stderr>>---+-> console window
               |               |  |    |                 |    |               |   |
               +---------------+  |    |  kernel space   |    +---------------+   |
                                  |    |  +------+       |                        |
                                  |    +--| pipe |--+    |                        |
                                  |       +------+  |    |                        |
                                  |        buffer   |    |                        |
                                  |                 |    |                        |
                                  |       +---------+    |                        |
                                  |       |              |                        |
                                  |       |  user space  |                        |
                                  |       |  +------+    |                        |
                                  |       +--|      |----+                        |
                                  |          +------+                             |
                                  |           buffer                              |
                                  |                                               |
                                  +-----------------------------------------------+

Filters and Pipelines

A filter is a program that reads data from its stdin, does some kind of operation on the data, and then writes that converted data to its stdout.

In the filter_programs folder there are Java programs that can act as filter programs. They are all very short programs that do simple manipulations of the input characters. Look at the source code. Compile and then run them using command-lines like the following.

    > java Reverse < Readme.txt > result.txt
    > java Double < Readme.txt | java Reverse
    > java Double | java ToUpperCase | java Reverse
    > java ShiftN 2 | java ToUpperCase | java Reverse
    > java Twiddle < Readme.txt | java ToUpperCase | java Double | java RemoveVowels > result2.txt

Then run a couple of the programs by themselves, without any I/O redirection or pipes, to see how they manipulate input data (from the keyboard) to produce output data (in the console window).

    > java ToUpperCase
    > java Double
    > java Reverse
    > java MakeOneLine

Notice that you need to tap the Enter key to send input from the keyboard to the program. Sometimes you see immediate output. Sometimes there is no output until the input is terminated (end-of-file). You denote the end of your input to the program by typing Control-z on Windows or Control-d on Linux. *Do not use Control-C. That terminates the program (instead of terminating just the program's input) and causes the program's output to be lost.

Command-line Syntax

We have seen that command-lines can be made up of, among other things, program names, command-line arguments, file names, I/O redirection operators, and pipes. In this section we will look at the syntax of building complex command-lines that combine all of these elements along with a few new elements.

CMD syntax.

Bash syntax.