CS 102 (Spring '04)
[Schedule]
[Examples]
[Programs]
[Notes
& Reference] [Syllabus]
[Lab & TA] [Tests]
[Grades]
Program #4: CyberWill
Prof. Reed, CS 102, Spring '03
Due Monday, 4/14 at 5:00 in the afternoon
William (Will) Shakespeare, wordsmith extraordinaire. Your task
is to write a program that can write like Will, and can also determine who wrote
a particular piece of prose. (This program is copied from Joe Zachary's assignment,
I suggest you take a look at it.)
[ernie] ~/programs/cyberwill> cyberwill
Author: Dale Reed
Assignment: #4, CyberWill
TA: Edgar A. Po, Mon 1:00-1:05
Using this program you can either generate random text
based on an input file, or you can determine who wrote
text in an input file.
Please select from the following options:
1) Generate Random text
2) Find out who wrote some text
3) Display possible text files
4) Exit the program
Your choice -> 3
Possible text files are:
OHenry.txt bible.txt homer.txt longfellow.txt wordfind.txt
Shakespeare.txt darwin.txt kipling.txt macbeth.txt
Please select from the following options:
1) Generate Random text
2) Find out who wrote some text
3) Display possible text files
4) Exit the program
Your choice -> 1
Enter input file name, degree, and ouput length: bible.txt 6 200
MaxSize of 500000 was reached.
Number of input characters is: 500000
Generating output with degree 6 and output length 200
ome To The World of Moses spake unto Moses took it
out, and thou badest
me: and shall be unclean until now,
my lord.
44:18 Then his host: and there reigned in ouches of the times
before the bottom of
Please select from the following options:
1) Generate Random text
2) Find out who wrote some text
3) Display possible text files
4) Exit the program
Your choice -> 2
Your input file will be compared against homer, OHenry, kipling,
longfellow, and macbeth.
Enter input file name to be analyzed: tempest.txt
That input is closest to macbeth.
Please select from the following options:
1) Generate Random text
2) Find out who wrote some text
3) Display possible text files
4) Exit the program
Your choice -> 4
Goodbye.
[ernie] ~/programs/cyberwill>
You need to know the following concepts in order to write this program:
Everything from the previous programs; How to use arrays, random number generator,
reading from a file. You will also need to know a bit about dynamic arrays.
Explanation:
- Solve the program a piece at a time. Start with the menu display. Then implement
random text generation. After that is done work on identifying who wrote some
particular text.
- Here are some suggestions for implementing random text generation.
- First get your program to open an input file and echo it to
the screen (or to an output file). I strongly suggest you create a small
file with known contents to use in debugging your program.
- Now dynamically declare an array to use to store the first 500,000 characters
of the input file. I'm limiting it in size, since otherwise, the program
is very s_l_o_w on the UNIX system, though it is passable on a standalone
Windows machine. To declare such an array, you would use:
// Dynamically created array. Creating it "normally"
// on the stack doesn't work because it is too big.
char *text = new char[MaxSize];
...
... // use the array throughout your program
...
delete [] text; // free up the space when done
Note that you would reference this array as you would any other array.
For instance, to refer to the element at position 3, we would use: text[
3]. In the definition of array text above, new is the C++
command to allocate more memory. In this particular example we are allocating
MaxSize number of char. MaxSize is a variable that
would have some number in it.
The name of the array is text. Remember that the name of an array
is the address of the first element. That is why the type of text
is char *.
- I suggest you write a function that receives the filename and large
array, opens the input file, and reads in the characters. Note that you
may have an input file that is larger than your array. You want to stop
once you've reached the array limit.
- Now write a function that generates the output. You should send it the
"window" size (i.e. the degree), the text array, the number
of characters in the text array, and the number of output characters.
This function should first declare (dynamically, as seen above) an array
to represent the "window."
- The "window" should then be initialized to be the characters
found at a random starting position in the file.
- Now step through the file, finding all places where the "window"
matches what is found in the file. For each of those places, keep track
of the character that is found immediately after the "window".
I suggest you count the number of each character found. For instance,
if my degree is 2, and my "window" contains the characters "th",
then I want to count how many times 'a' follows "th", how many
times 'b' follows "th", how many times 'c' follows "th",
and so on. When I'm done with this step I have an array that counts how
many times each character follows those in the "window".
- Now find the sum of the counts of the characters accumulated in the
previous step. Declare a dynamic array of this size, and populate it with
the actual characters. In other words, if my "window" were the
characters "th", I may discover that in the input file an 'a'
follows "th" four times, an 'e' follows "th" 6 times,
and an 'i' follows "th" 2 times. 4+6+2 is 12, so I would dynamically
create an array of size 12. Then put the characters into that array, so
that array might look like:
aaaaeeeeeeii
Then randomly select one of these characters. That is the character to
use for output.
- Each time you display an output character (e.g. cout<<theCharacter;
) you should also execute the statement cout<<flush;
to force the output to appear on the screen (you can combine these
into a single cout statement). Otherwise output is sometimes buffered
until a full line is accumulated, and it seems as if your program is stalled.
- Now "shift left" your window. The left-most character is replaced
by the next character, and so on, and the right-most character in your
window becomes the output character selected in the steps above. Note
that the "window" is initialized with contents from the file,
but after that it is the generated characters that gradually fill in and
modify the contents of the "window". Note that the contents
of the window is never printed out.
- Repeat this process, until you've produced as output the required number
of characters.
- For identifying who wrote some text, we will compare some unknown author
input file (the "mystery" file) to 6 known text files: bible.txt,
macbeth.txt, kipling.txt, homer.txt, OHenry.txt, and longfellow.txt. The program
identifies which file matches most closely.
You will need to compute the letter probabilities in each of the potential
matching files. We could quickly run into space problems here, so we are going
to restrict this by considering only alphabetical characters, and converting
all characters to uppercase. We want to keep track of degree 4, so we will
dynamically declare a 4-dimensional array of integers. I don't want to explain
all the reasoning behind how this works at this point, so simply copy and
paste the code below when it comes time to create this array. Then go ahead
and use it as a 4-dimensional array.
#include // For general IO
#include // For assertions
#include // For character testing functions
#include // For file input and output
#include // for random number generation
#include // needed for fabs function (absolute value of a float)
using namespace std;
const int MaxSize = 500000; // King James Bible is 4,345,144 bytes
const int NumberOfCharacters = 128; // number of characters
const int LineLength = 81; // standard line length
const int AlphabetSize = 26; // number of alphabetic characters
// Allocate memory for character counting array.
// Note that the '&' is needed so that the new array
// address is passed back as a reference parameter.
void allocateCountingArrayMemory( int **** & matrix)
{
// Allocate space for array
matrix = new int***[ AlphabetSize];
for (int i=0; i < AlphabetSize; i++) {
matrix[i] = new int**[ AlphabetSize];
for (int j=0; j < AlphabetSize; j++) {
matrix[i][j] = new int*[ AlphabetSize];
for (int k=0; k < AlphabetSize; k++) {
matrix[i][j][k] = new int[ AlphabetSize];
// Initialize values to 0 here
for (int m=0; m < AlphabetSize; m++) {
matrix[i][j][k][m] = 0;
}
}
}
}
}
// Deallocate memory for the character counting array. I suppose
// the '&' to make matrix a reference parameter is not strictly
// necessary, since the memory is freed up in either case.
void deallocateCountingArrayMemory( int **** & matrix)
{
// Deallocate dynamically allocated space for the array
for (int i=0; i < AlphabetSize; i++) {
for (int j=0; j < AlphabetSize; j++) {
for (int k=0; k < AlphabetSize; k++) {
delete [] matrix[i][j][k];
}
delete [] matrix[i][j];
}
delete [] matrix[i];
}
delete [] matrix; // delete the array at the outermost level
}
int main()
{
// number of input files to compare against
const int numberOfFiles = 6;
// array of input file names
char fileNames[ numberOfFiles][ LineLength] = {
"bible.txt",
"macbeth.txt",
"kipling.txt",
"homer.txt",
"OHenry.txt",
"longfellow.txt"};
// To use a single filename use:
cout << fileNames[ 3]; // displays kipling.txt
// You have to manipulate it a bit to display
// it without the ".txt"
// Now illustrate allocating (and later deallocating)
// the big counting arrays
// First declare the pointers to the array.
int **** knownFileCharCountsArray;
int **** mysteryFileCharCountsArray;
// allocate memory
allocateCountingArrayMemory( knownFileCharCountsArray);
allocateCountingArrayMemory( mysteryFileCharCountsArray);
// note that in following line each subscript must be in range 0..25
// Illustration of use:
knownFileCharCountsArray[1][4][3][2] = 5;
// deallocate memory
deallocateCountingArrayMemory( knownFileCharCountsArray);
deallocateCountingArrayMemory( mysteryFileCharCountsArray);
return 0;
}
Using this array, we will then count how many times each 4-letter sequence
of letters appears. For instance, we count how many times "AAAA"
appears, how many times "AAAB" appears, and so on. We accumulate
these counts for each of the possible matching input files. We then consider
the percentages of each character relative to the total, and compare this
to the same analysis done on the file which is being analyzed (the "mystery"
file). The file that matches most closely is selected as the match.
For instance, let's say that in our mystery file there is a total of 100 characters,
where the 4-letter sequence "THES" appears 5 times, and the 4-letter
sequence "THE " appears 20 times. We then figure out the proportions
of each count relative to the total. "THES" appears 5/100 = 5% of
the time, and "THE " appears 20/100 = 20% of the time.
Then we do the same analysis on the first of the 6 known-author text files.
Let's pretend that in the first one (bible.txt) there are a total of 200 characters,
where "THES" appears 50 times, and "THE " appears 120
times. The proportion in this file for "THES" is 50/200 = 25%, and
for "THE " it is 120/200 = 60%.
We then sum the absolute value of the differences between these percentages.
In the above example we add:
abs(5%-25%)
+ abs(20%-60%) = .20 + .40 = .60
We use this number .60 as the measure of how similar our "mystery"
file is to bible.txt. Of course the actual values will be much different than
this, and there will be 4-letter-sequence counts for many more than just "THES"
and "THE ". I recommend that you store the differences as type double,
which means you will have to use function fabs to find the absolute
value. When using fabs, you will also need #include <cstdlib>
at the top of your program.
- You can run my version of this program at ~i102/programs/cyberwill/cyberwill.
Various text files (*.txt) are also at that location if you would like to
download them from there (text files taken from www.gutenberg.org). Alternatively
you can access them from the following links: bible.txt
(4.24M), darwin.txt (18K), wordfind.txt
(23K), macbeth.txt (118K), kipling.txt
(369K), homer.txt (812K), OHenry.txt
(417K), longfellow.txt (1.86M), will.txt
(4.43M), and tempest.txt (97K).
You can use tempest.txt as your test file for part 2, since it is also
written by Shakespeare and should therefore be most similar to Shakespeare's
macbeth.txt.
- Turnin your program into the program4
project.
[CS Dept] [UIC]
[Prof. Reed]