Generate a “graphical” Zipf distribution for the entire text of Moby Dick
In line 12, we remove HTML tags from a version of the text found on the web. Line 14 splits the entire
text into words, and continues with a Linq grouping expression that tallies distinct words into an
anonymous type. After deriving a scaling factor for the graph, line 22 prints an ASCII histogram
bar for each of the top 35 words.
moby_dick.html
04 | using System.Text.RegularExpressions; |
10 | String text = new StreamReader( "moby_dick.html" ).ReadToEnd(); |
12 | text = Regex.Replace(text, "<(.|\n)*?>" , String.Empty); |
15 | .Split( " \n\",.;-!?" .ToCharArray(), StringSplitOptions.RemoveEmptyEntries); |
16 | .GroupBy(w => w.ToLower()) |
17 | .Select(g => new { g.Key, Tally = g.Count() }) |
18 | .OrderByDescending(e => e.Tally); |
20 | int scale = tallies.First().Tally / 60; |
21 | foreach (var tally in tallies.Take(35)) |
22 | Console.WriteLine( "{0,6} {1}" , tally.Key, new String( '*' , tally.Tally / scale)); |
Output.
01 | the ************************************************************ |
02 | of *************************** |
03 | and ************************** |
04 | to ******************* |