Basic Stylometry Beta (early access)

Post by **Peter Kirby** » Mon May 25, 2015 6:14 pm

An "early access" beta of a stylometry program is now available here:

http://peterkirby.com/basic-stylometry.html

The program is very raw and may still have bugs, so you will need some patience if you want to get results with it. Here is some documentation.

Please report any bugs that you encounter, along with the error message and (if feasible) all the data you entered.

Entering Data

(1) The 'Words'

This program is exclusively based on the frequency of single words. (I might use other metrics in future efforts.)

You can use multiple tokens per line, in order to match any of the tokens. You can use this, for example, to list all the forms of a word making up a single lemma. You can also use this to combine multiple words (e.g., multiple prepositions, multiple conjuctions) that do not show up frequently enough on their own but do show up in significant quantities together.

Initial testing has shown that this list of twenty items has significant discriminatory power for samples that are approximately 750+ words (preferrably 2000+ words) in length. You can add to, subtract from, or modify the list as you like.

o oi h ai to ta
tou twn ths
tw| tois th| tais
ton tous thn tas
kai
te
de d
men
alla all
gar
eis
en
dia di
ek ec
kata kat kaq
pros

autos autou autw| auton autoi autwn autois autous auth auths auth| authn autai autwn autais autas auto auta

outos toutou toutw| touton autoi toutwn toutois toutous auth tauths tauth| tauthn autai tautais tautas touto touto tauta

tis tinos tini tina tines tinwn tisi tisin tinas ti tina

eimi ei esti estin esmen este eisi eisin hn hsqa hn hmen hte hsan esomai esh| esei estai esomeqa esesqe esontai w h|s h| wmen hte wsi eihn eihs eih eihmen eimen eihte eite eihsan eien esoimhn esoio esoito esoimeqa esoisqe esointo isqi estw este estwn ontwn estwsan einai esesqai wn ousa on esomenos esomenh esomenon

Here are ten additional words that are sometimes useful.

epi ep
mh
oti
upo up
apo ap
meta met meq
ou ouk oux

polus pollou pollw| polun pollh pollhs pollh| pollhn polu pollou pollw| polu polloi pollwn pollois pollous pollai pollwn pollais pollas polla pollwn pollois polla

pas pantos panti panta pas pasa pashs pash| pasan pasa pan pantos panti pan pantes pantwn pasi pasin pantas pantes pasai paswn pasais pasas pasai panta pantwn

ode toude tw|de tonde oide twnde toisde tousde hde thsde th|de thnde aide twnde taisde tasde tode toude tode tade ekeinos ekeinou ekeinw| ekeinon ekeinoi ekeinwn ekeinois ekeinous ekeinh ekeinhs ekeinh| ekeinhn ekeinai ekeinais ekeinas ekeino ekeinou ekeinw| ekeino ekeina ekeinwn ekeinois ekeina

A good rule of thumb is that each 'word' should appear about 5 times or more, on average, in each 'Sample'-sized extract.

There is no single list of words that work best for distinguishing the style of all authors from all other authors. Some adjustment is generally required to arrive at a list that works well for distinguishing the works of one author from the rest.

(2) The 'Sample'

The sample is what you want to compare to the various possible 'authors' to see if any of them are a likely match.

A sample between 1,000 words and 5,000 words in length is preferred. Samples as small as approximately 500 words may work occasionally.

All input must be in "Beta Code." You can copy Beta Code out of the TLG with Diogenes, for example. The program will automatically discard non-word characters such as quote marks, numbers, and other symbols. If you don't have the TLG with Diogenes, you can get it here:

https://kat.cr/tlg-phi-cd-rom-e-with-an ... uarium.2.0

Diogenes has a setting to change its results to "Beta Code." Be sure to use this setting in order to copy over text. You can also increase the number of lines that are returned by Diogenes. Other sources of "Beta Code" Greek are also, of course, acceptable.

(3) The 'Authors'

These are the candidates being considered as possible authors.

The 'Sample' should not appear in any of the authors (remove the 'Sample' from the text of the author, if applicable).

If it's possible that the author isn't any one of these candidates, you'll need to use your own judgment in interpreting the results (see below).

The 'Author' sections should be as long as they can be, while not including any composite material (i.e., material from more than one author). At a minimum they should be about 4-5x as long as the 'Sample'. If necessary, you can break up the 'Sample' into parts (tested separately) in order to meet this requirement.

(4) The 'Controls'

These are extracts of ancient Greek that are known not to be possible authors of the 'Sample'.

Again, the longer the better, while still avoiding composite texts.

These help you decide whether you believe that one of the candidates were the author or whether none of them were. (If one of the controls is a significantly closer match for the sample than the best candidate, that may mean that none of the candidates are matched signficantly-enough with the sample and that none of them can be declared the author.)

Interpreting the Results

When you've entered your data, press submit, and the results should appear in a new tab. This allows you to preserve your form data (at least for the length of the browser session). You would want to save your form data elsewhere if you want to work with it over multiple browser sessions.

(1) testsize, testsample, sample frequencies, and word list

This just repeats some of the data that you entered, along with a measurement of the sample length and the frequencies of the words in the 'Sample'.

(2) Author Stats and Control Stats

For each of the 'author'/'control' extracts, for each 'word', the mean and standard deviation for the appearance of the 'word' is calculated, based on samples made from the 'author'/'control' that are the exact same size as the 'Sample'.

(3) Author Q-Values and Control Q-Values

For each of the 'author'/'control' extracts, for each 'word', the actual number of the appearances of the 'word' in the 'Sample' (the observed frequency in the 'Sample') is compared to the mean and standard deviation for the 'author'/'control' in order to arrive at a statistic called a Q-value (calculated from the Z-score), which estimates the likelihood that a number which is that far from the mean (or farther) would be selected from the normal distribution (based on the mean and standard deviation) calculated from the respective 'author'/'control'. Values closer to 1 are based on observed frequencies closer to the mean that are more 'likely', while values closer to 0 are based on observed frequencies several standard deviations away from the mean that are less 'likely.'

(4) Author Chi-Square-Based P-Values and Control Chi-Square-Based P-Values

For each 'author'/'control', the Q-values are combined using a method known as "[wiki]Fisher's method[/wiki]" for combining several p-values, in order to arrive at a single p-value representing the likelihood that all the observed frequencies in the 'Sample' would be generated from that 'author'/'control' (according to the normal distribution, with the measured means and standard deviations).

This is not the only way to arrive at a combined p-value statistic. This method gives equal weighting to all of the individual components. This method is more sensitive to the effect of any one single extreme value on the result than the other one (below).

(5) Bayesian Author Test: Posterior Probabilities from Equal Priors, Chi-Square-Based Method

This is a straight comparison of the different candidate authors, using equal prior likelihood for each. The most likely candidate, using the chi-square-based method, has the highest number.

(6) Bayesian Comparison of Best Author to Best Control: from Equal Priors, Chi-Square-Based Method

This is a comparison of the best 'author' candidate to the best 'control' candidate. "$VAR1" indicates which 'author' actually is the best candidate, while "$VAR3" indicates which 'control' actually is the best candidate. The one that is a closer match will have a value greater than 0.5.

Initial testing, using a large number of controls, shows that "$VAR2" values greater than 0.51 are typical for the author, while "$VAR2" values less than 0.3 are typical if the candidate being considered is not likely to be a reliable result. (Values between 0.3 and 0.51 do not appear to be probative; further analysis is recommended to confirm or disconfirm the result.)

(7) Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, Chi-Square-Based Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Second-Best Author, Chi-Square-Based Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, Chi-Square-Based Method

This scores every 'Sample'-sized sample in the best candidate author and every 'Sample'-sized sample outside the best candidate author, then uses Bayesian analysis to determine how likely a sample is to be from the best candidate author (and not outside that author generally, in the second best author candidate instead, or in the best control candidate instead) if it is as closely similar to the best candidate author as the 'Sample' is.

Initial testing indicates that, if first value is less than 0.9, or if either of the other two is less than 0.7, the candidate being considered may not be a reliable result.

(8) Author Z-Score-Based P-Values and Control Z-Score-Based P-Values

These results are similar to (4) above, which is based on Fisher's chi-square method for combining p-values.

Here, the absolute values of the Z-scores are taken. A weighted average is computed. And then the resulting combined "Z-score" is used to calculate a combined "p-value."

The weight assigned to each Z-score is equal to the square root of the mean number of appearances of the given 'word'. (If this mean value is less than 4, then a weight of 0 is assigned, and it is thereby removed from consideration.)

This method is less sensitive to extreme individual Q-values (since it takes an average of Z-scores instead of working with the Q-values directly) and is less sensitive to the effect of 'words' with fewer observations (due to weighting).

It is also insensitive to the effect of 'words' with too few appearances on average for the observed frequency to be significant (thus, it will perform more robustly when the practitioner fails to remove such uncommon words on their own).

(9) Bayesian Author Test: Posterior Probabilities from Equal Priors, Z-Score-Based Method

Like (5) above, this is a straight comparison of the candidate authors.

(10) Bayesian Comparison of Best Author to Best Control: from Equal Priors, Z-Score-Based Method

Like (6) above, this is a comparison of the best candidate author against the best candidate control.

(11) Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, Z-Score-Based Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Second-Best Author, Z-Score-Based Method
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, Z-Score-Based Method

Like (7) above.

Some more notes:

Sometimes a result is deemed 'unreliable' with one method but appears to be 'reliable' with another. This is acceptable. It should be interpreted as 'reliable.'

Sometimes a result is deemed 'unreliable' with one method, and a different result appears to be 'reliable' with another. This is acceptable. The latter should be interpreted as 'reliable.'

Sometimes two different 'author' candidate results might be deemed 'reliable,' each according to a different method. Favor the Z-score-based method.

(I will show a worked example in another post....)

Ben C. Smith · Post by **Ben C. Smith** » Mon May 25, 2015 7:33 pm

You just never seem to fail to impress, Peter Kirby.

(Hey, guys, remember that one time, several years ago, when Peter Kirby failed to impress? .... Yeah, thought so. Nor do I.)

Ben.

Post by **Peter Kirby** » Mon May 25, 2015 7:52 pm

LOL. Thanks!

Post by **Peter Kirby** » Mon May 25, 2015 8:48 pm

Here's a little study of whether the First Apology (attributed to Justin) and the Second Apology (attributed to Justin) were written by the author of the Dialogue with Trypho (attributed to Justin). It is intended as an example of the sort of thing you can do with the program and also as a sort of sanity check for those who are a bit skeptical (and who believe that the result should indicate that the same person wrote all three).

Author Group
#1 Justin (Dialogue with Trypho)
#2 Tatian
#3 Athenagoras
#4 Irenaeus
#5 Clement of Alexandria
#6 Origen
#7 Eusebius

Control Group
#1 Josephus
#2 Acts
#3 Mark
#4 John
#5 1Cor
#6 Hebrews
#7 Revelation
#8 Life of Adam and Eve
#9 1 Maccabees
#10 2 Maccabees
#11 Polybius
#12 Diodorus Siculus
#13 Dionysius Halicarnassus
#14 Strabo
#15 Plutarch
#16 Arrian
#17 Herodian
#18 Herodotus
#19 Thucydides
#20 Xenophon
#21 Epictetus
#22 Galen
#23 Lucian
#24 Philostratus
#25 Basil
#26 John Chrysostom

To test the First Apology, we take a 4641 word sample consisting of chapters 50 through 68.

testsize: 4641

Author Chi-Square-Based P-Values
$VAR1 = '0.888460855804236'; $VAR2 = 0; $VAR3 = 0; $VAR4 = '0.00631746893963075'; $VAR5 = '1.01758673269655e-08'; $VAR6 = '3.82515355529966e-10'; $VAR7 = '0.000730455451654238';

Control Chi-Square-Based P-Values
$VAR1 = '6.50657222634683e-49'; $VAR2 = 0; $VAR3 = 0; $VAR4 = 0; $VAR5 = 0; $VAR6 = 0; $VAR7 = 0; $VAR8 = '0.0217532829559411'; $VAR9 = 0; $VAR10 = 0; $VAR11 = 0; $VAR12 = 0; $VAR13 = 0; $VAR14 = '8.46752788675696e-21'; $VAR15 = 0; $VAR16 = 0; $VAR17 = 0; $VAR18 = 0; $VAR19 = '2.2846420327196e-32'; $VAR20 = 0; $VAR21 = 0; $VAR22 = '2.1288841096972e-40'; $VAR23 = '6.11732606067973e-33'; $VAR24 = 0; $VAR25 = '1.63086849061174e-19'; $VAR26 = '2.44254151170694e-18';

Bayesian Author Test: Posterior Probabilities from Equal Priors, Chi-Square-Based Method
$VAR1 = '0.992129686472721'; $VAR2 = '0'; $VAR3 = '0'; $VAR4 = '0.00705461409743644'; $VAR5 = '1.13632243837593e-08'; $VAR6 = '4.27148632687276e-10'; $VAR7 = '0.0008156876394695';

Bayesian Comparison of Best Author to Best Control: from Equal Priors, Chi-Square-Based Method
$VAR1 = 1; $VAR2 = '0.976100917323069'; $VAR3 = 8; $VAR4 = '0.0238990826769311';

Percentage of Samples in the Best Author Candidate that Meet the P-Value>0.7 Test, Chi-Square-Based Method
0.181818181818182
Percentage of Samples outside the Best Author Candidate that Meet the P-Value>0.7 Test, Chi-Square-Based Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, Chi-Square-Based Method
1

Percentage of Samples in the Second-Best Author Candidate that Meet the P-Value>0.7 Test, Chi-Square-Based Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Second-Best Author, Chi-Square-Based Method
1

Percentage of Samples in the Best Control Candidate that Meet the P-Value>0.7 Test, Chi-Square-Based Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, Chi-Square-Based Method
1

Author Z-Score-Based P-Values
$VAR1 = '0.211556501767805'; $VAR2 = '9.87894456773742e-42'; $VAR3 = '2.11748551596198e-09'; $VAR4 = '0.00560915487879656'; $VAR5 = '0.0354182939677903'; $VAR6 = '0.0907131978115263'; $VAR7 = '0.0926953240598759';

Control Z-Score-Based P-Values
$VAR1 = '0.00158311378490499'; $VAR2 = '0.000395314434282135'; $VAR3 = '1.20917740013948e-14'; $VAR4 = '2.64020185029806e-12'; $VAR5 = '0'; $VAR6 = '3.11298773021434e-224'; $VAR7 = '2.86618459242993e-51'; $VAR8 = '0.119270040623747'; $VAR9 = '4.31835159940076e-13'; $VAR10 = '3.78634400157456e-13'; $VAR11 = '3.26965236089271e-05'; $VAR12 = '8.28894037464006e-05'; $VAR13 = '2.83748794406989e-05'; $VAR14 = '0.00699144781369404'; $VAR15 = '8.7302454204118e-16'; $VAR16 = '5.45242882713467e-05'; $VAR17 = '1.69376834935096e-09'; $VAR18 = '2.89547666909129e-07'; $VAR19 = '0.000276715507818051'; $VAR20 = '0.00215812982785909'; $VAR21 = '8.09173261441971e-14'; $VAR22 = '0.000342826130120986'; $VAR23 = '0.0126968173134049'; $VAR24 = '1.1126934485835e-07'; $VAR25 = '0.0142496971279844'; $VAR26 = '0.00448348611287948';

Bayesian Author Test: Posterior Probabilities from Equal Priors, Z-Score-Based Method
$VAR1 = '0.48522970943548'; $VAR2 = '2.26585208304949e-41'; $VAR3 = '4.85670198296137e-09'; $VAR4 = '0.0128652561810854'; $VAR5 = '0.0812360213327496'; $VAR6 = '0.208061384302719'; $VAR7 = '0.212607623891265';

Bayesian Comparison of Best Author to Best Control: from Equal Priors, Z-Score-Based Method
$VAR1 = 1; $VAR2 = '0.639478622961926'; $VAR3 = 8; $VAR4 = '0.360521377038074';

Percentage of Samples in the Best Author Candidate that Meet the P-Value>0.21 Test, Z-Score-Based Method
0.727272727272727
Percentage of Samples outside the Best Author Candidate that Meet the P-Value>0.21 Test, Z-Score-Based Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, Z-Score-Based Method
1

Percentage of Samples in the Second-Best Author Candidate that Meet the P-Value>0.21 Test, Z-Score-Based Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Second-Best Author, Z-Score-Based Method
1

Percentage of Samples in the Best Control Candidate that Meet the P-Value>0.21 Test, Z-Score-Based Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, Z-Score-Based Method
1

The results indicate that the author is the same as the author of the Dialogue with Trypho.

To test the Second Apology, we put the entire 3295-word text in.

Author Chi-Square-Based P-Values
$VAR1 = '1.94493602477902e-05'; $VAR2 = '6.87563449738722e-49'; $VAR3 = 0; $VAR4 = 0; $VAR5 = '3.74277169629409e-28'; $VAR6 = 0; $VAR7 = '5.86049482486032e-12';

Control Chi-Square-Based P-Values
$VAR1 = '3.45057626008072e-39'; $VAR2 = 0; $VAR3 = 0; $VAR4 = 0; $VAR5 = 0; $VAR6 = 0; $VAR7 = 0; $VAR8 = '7.54061444420771e-11'; $VAR9 = 0; $VAR10 = 0; $VAR11 = 0; $VAR12 = '1.93744868969708e-42'; $VAR13 = 0; $VAR14 = '7.58407562510822e-15'; $VAR15 = 0; $VAR16 = 0; $VAR17 = 0; $VAR18 = 0; $VAR19 = 0; $VAR20 = 0; $VAR21 = '1.57529752644245e-43'; $VAR22 = '3.84962930383549e-17'; $VAR23 = '1.64623005556979e-23'; $VAR24 = 0; $VAR25 = '3.95474661623243e-26'; $VAR26 = '4.87924453750406e-24';

Bayesian Author Test: Posterior Probabilities from Equal Priors, Chi-Square-Based Method
$VAR1 = '0.999999698679392'; $VAR2 = '3.53514580326519e-44'; $VAR3 = '0'; $VAR4 = '0'; $VAR5 = '1.92436693075552e-23'; $VAR6 = '0'; $VAR7 = '3.0132060820038e-07';

Bayesian Comparison of Best Author to Best Control: from Equal Priors, Chi-Square-Based Method
$VAR1 = 1; $VAR2 = '0.999996122964913'; $VAR3 = 8; $VAR4 = '3.87703508645621e-06';

Percentage of Samples in the Best Author Candidate that Meet the P-Value>1.9e-05 Test, Chi-Square-Based Method
0.933333333333333
Percentage of Samples outside the Best Author Candidate that Meet the P-Value>1.9e-05 Test, Chi-Square-Based Method
0.0792079207920792
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, Chi-Square-Based Method
0.921773142112125

Percentage of Samples in the Second-Best Author Candidate that Meet the P-Value>1.9e-05 Test, Chi-Square-Based Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Second-Best Author, Chi-Square-Based Method
1

Percentage of Samples in the Best Control Candidate that Meet the P-Value>1.9e-05 Test, Chi-Square-Based Method
0.2
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, Chi-Square-Based Method
0.823529411764706

Author Z-Score-Based P-Values
$VAR1 = '0.0806523205258673'; $VAR2 = '0.00144687349965014'; $VAR3 = '0.00230037565992109'; $VAR4 = '0.00170887928048379'; $VAR5 = '0.0410372784494046'; $VAR6 = '0.0354496933328968'; $VAR7 = '0.0571990842645113';

Control Z-Score-Based P-Values
$VAR1 = '0.00462825929965623'; $VAR2 = '0.000103340653796794'; $VAR3 = '9.27852482839872e-06'; $VAR4 = '1.1369044928401e-08'; $VAR5 = '9.5151095815589e-14'; $VAR6 = '2.07222888758312e-234'; $VAR7 = '1.93349986706008e-78'; $VAR8 = '0.0606829808842978'; $VAR9 = '8.26439242154331e-13'; $VAR10 = '4.18605420315407e-07'; $VAR11 = '2.35700063856191e-05'; $VAR12 = '0.000182856448558814'; $VAR13 = '0.00171504960896788'; $VAR14 = '0.0279026368047166'; $VAR15 = '1.56316735712744e-09'; $VAR16 = '0.000240126094636099'; $VAR17 = '1.79762719439227e-05'; $VAR18 = '4.06081317403127e-05'; $VAR19 = '0.00345519144246145'; $VAR20 = '0.00237669054742433'; $VAR21 = '0.00666006758748458'; $VAR22 = '0.0153212403407854'; $VAR23 = '0.0491781693169549'; $VAR24 = '0.000328289585005977'; $VAR25 = '0.0300989557781122'; $VAR26 = '0.0246370106233428';

Bayesian Author Test: Posterior Probabilities from Equal Priors, Z-Score-Based Method
$VAR1 = '0.36694420782355'; $VAR2 = '0.00658284655281216'; $VAR3 = '0.0104660289836991'; $VAR4 = '0.00777489537504488'; $VAR5 = '0.186707481367775'; $VAR6 = '0.161285621452833'; $VAR7 = '0.260238918444286';

Bayesian Comparison of Best Author to Best Control: from Equal Priors, Z-Score-Based Method
$VAR1 = 1; $VAR2 = '0.570645264991572'; $VAR3 = 8; $VAR4 = '0.429354735008428';

Percentage of Samples in the Best Author Candidate that Meet the P-Value>0.08 Test, Z-Score-Based Method
0.933333333333333
Percentage of Samples outside the Best Author Candidate that Meet the P-Value>0.08 Test, Z-Score-Based Method
0.0396039603960396
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, Z-Score-Based Method
0.959294436906377

Percentage of Samples in the Second-Best Author Candidate that Meet the P-Value>0.08 Test, Z-Score-Based Method
0
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Second-Best Author, Z-Score-Based Method
1

Percentage of Samples in the Best Control Candidate that Meet the P-Value>0.08 Test, Z-Score-Based Method
0.2
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, Z-Score-Based Method
0.823529411764706

Last but not least, let's identify an unreliable result (based on an excerpt of Romans 3-4).

testsize: 829

Author Chi-Square-Based P-Values
$VAR1 = '1.64139255508376e-05'; $VAR2 = '8.94264926804366e-24'; $VAR3 = '7.14474143179853e-17'; $VAR4 = '1.00436073947967e-19'; $VAR5 = '1.03907335376701e-08'; $VAR6 = '2.50294178139516e-08'; $VAR7 = '1.9828950617084e-13';

Control Chi-Square-Based P-Values
$VAR1 = '3.85812800821849e-36'; $VAR2 = '4.70936019769718e-28'; $VAR3 = '2.35342420462997e-39'; $VAR4 = '4.81452529568474e-37'; $VAR5 = '3.98988031780169e-09'; $VAR6 = 0; $VAR7 = 0; $VAR8 = '2.95840507860174e-17'; $VAR9 = 0; $VAR10 = '1.25070413738294e-51'; $VAR11 = '5.05211938633974e-43'; $VAR12 = '4.59710989342204e-34'; $VAR13 = '1.5275742423603e-40'; $VAR14 = '3.90120214184003e-15'; $VAR15 = 0; $VAR16 = 0; $VAR17 = 0; $VAR18 = 0; $VAR19 = '4.23818579524785e-28'; $VAR20 = '3.37160534646114e-33'; $VAR21 = '4.50902205262221e-17'; $VAR22 = '9.7817794756768e-12'; $VAR23 = '7.35790723030463e-32'; $VAR24 = 0; $VAR25 = '1.88975491293184e-05'; $VAR26 = '7.27129411865981e-16';

Bayesian Author Test: Posterior Probabilities from Equal Priors, Chi-Square-Based Method
$VAR1 = '0.997846701630155'; $VAR2 = '5.43647712322986e-19'; $VAR3 = '4.34348057059187e-12'; $VAR4 = '6.1057791936035e-15'; $VAR5 = '0.000631680651649651'; $VAR6 = '0.00152160565929337'; $VAR7 = '1.20545526472397e-08';

Bayesian Comparison of Best Author to Best Control: from Equal Priors, Chi-Square-Based Method
$VAR1 = 1; $VAR2 = '0.464832627340306'; $VAR3 = 25; $VAR4 = '0.535167372659694';

Percentage of Samples in the Best Author Candidate that Meet the P-Value>1.6e-05 Test, Chi-Square-Based Method
0.983870967741935
Percentage of Samples outside the Best Author Candidate that Meet the P-Value>1.6e-05 Test, Chi-Square-Based Method
0.366906474820144
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, Chi-Square-Based Method
0.728373851043725

Percentage of Samples in the Second-Best Author Candidate that Meet the P-Value>1.6e-05 Test, Chi-Square-Based Method
0.807692307692308
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Second-Best Author, Chi-Square-Based Method
0.549168975069252

Percentage of Samples in the Best Control Candidate that Meet the P-Value>1.6e-05 Test, Chi-Square-Based Method
0.804878048780488
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, Chi-Square-Based Method
0.550032988783813

Author Z-Score-Based P-Values
$VAR1 = '0.092944936302373'; $VAR2 = '0.0179815177344596'; $VAR3 = '0.0302195183096689'; $VAR4 = '0.0248526327223616'; $VAR5 = '0.0559421645002796'; $VAR6 = '0.0879517542012107'; $VAR7 = '0.0421041182426673';

Control Z-Score-Based P-Values
$VAR1 = '0.00519145846504496'; $VAR2 = '0.0215540129534138'; $VAR3 = '0.0187010097507574'; $VAR4 = '0.0146081967706652'; $VAR5 = '0.0917803912120555'; $VAR6 = '0.00293637254008674'; $VAR7 = '0.000195125886212675'; $VAR8 = '0.0470097764320487'; $VAR9 = '0.000884154117507113'; $VAR10 = '0.00293141385939037'; $VAR11 = '0.00332657651863712'; $VAR12 = '0.00425320970557994'; $VAR13 = '0.00646680016129299'; $VAR14 = '0.0260502374964299'; $VAR15 = '0.000481371517759226'; $VAR16 = '0.001802460355443'; $VAR17 = '6.4985196428693e-06'; $VAR18 = '3.34298816939969e-05'; $VAR19 = '0.000273259415753231'; $VAR20 = '0.0134573437681482'; $VAR21 = '0.0354389905667045'; $VAR22 = '0.0406696129722549'; $VAR23 = '0.0138978627036434'; $VAR24 = '2.37747537918995e-05'; $VAR25 = '0.0784283645521996'; $VAR26 = '0.0242931502755357';

Bayesian Author Test: Posterior Probabilities from Equal Priors, Z-Score-Based Method
$VAR1 = '0.264050633468642'; $VAR2 = '0.0510843445313165'; $VAR3 = '0.0858517232915847'; $VAR4 = '0.0706047437845793'; $VAR5 = '0.15892811982624'; $VAR6 = '0.24986532172076'; $VAR7 = '0.119615113376876';

Bayesian Comparison of Best Author to Best Control: from Equal Priors, Z-Score-Based Method
$VAR1 = 1; $VAR2 = '0.503152099135476'; $VAR3 = 5; $VAR4 = '0.496847900864524';

Percentage of Samples in the Best Author Candidate that Meet the P-Value>0.09 Test, Z-Score-Based Method
0.983870967741935
Percentage of Samples outside the Best Author Candidate that Meet the P-Value>0.09 Test, Z-Score-Based Method
0.405275779376499
Posterior Probability of a Sample Meeting the Test Being by the Best Author Candidate (with Prior = 0.5), not Any Other, Z-Score-Based Method
0.708255603508284

Percentage of Samples in the Second-Best Author Candidate that Meet the P-Value>0.09 Test, Z-Score-Based Method
0.846153846153846
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Second-Best Author, Z-Score-Based Method
0.537627118644068

Percentage of Samples in the Best Control Candidate that Meet the P-Value>0.09 Test, Z-Score-Based Method
0.5
Posterior Probability of a Sample Meeting the Test Being by the Best Author, not the Best Control Author, Z-Score-Based Method
0.66304347826087

The bolded parts all are indications that the result is not reliable and can be discarded.

Post by **Peter Kirby** » Tue May 26, 2015 3:16 pm

Updates:

(a) A worked example, based on the works of Justin Martyr, has been posted above.

(b) Some of the specific guidance (in the OP) for identifying a 'reliable' or 'unreliable' result has been adjusted, based on experience.

(c) A list of ten additional words that may often be useful has been added to the OP.

(d) The order of the results being printed by the program has been reorganized, so that less-important information is appended at the end.

(e) A bug has been fixed.

Post by **Peter Kirby** » Tue May 26, 2015 3:35 pm

Methodologically you will encounter a few problems most frequently:

(1) False Positive, and the Actual Author Is Listed among the Candidates

Check that everything is set up sufficiently well: make sure that the 'Sample' is generous (>2000 words preferrably) and that the list of 'words' includes enough features to discriminate between authors effectively.

This can happen when there is a "dead ringer" or "look-alike" for the actual author among the 'authors'. What I mean by this is that there may be another extracted candidate-author that has average/mean values for each of the 'words' being considered that are very close to the average/mean values for the actual author.

This can happen, and it is just a natural consequence of the type of measurements being made. Stylometry is less like genetic sequencing (comparing whether two samples are a DNA match) and more like biometrics (comparing whether a person matches the height, weight, sex, build, and description of the suspect). Even with a modestly-adequate list of biometric or stylometric measurements, there will typically be more than one person in the population that has these measurements.

The crude solution to this is to remove your "dead ringer" from the list of 'authors', if it is not really a genuine candidate. (It is not recommended to remove it from the list of candidate authors, if it is genuinely a possible candidate author.) This will mean that the results are not as robust as they might be otherwise, but it could keep this "look-alike" from interfering with the task of determining whether the sample is more like the (believed) actual author than like the other candidates.

A better solution is to adjust (add to, subtract from, or modify) the list of words and/or to take a larger 'Sample' (if possible). This may provide the data that is needed to tell the two "look alikes" apart (for example, perhaps one of them has the stylometric equivalent of a "mole on the left elbow" but the other doesn't--adding another preposition or two to consideration may be just what is needed).

(Words may need to be subtracted, not just added, if they are not, in fact, invariant regardless of subject, for this particular author.)

(2) False Positive, and the Actual Author Is Not Listed among the Candidates

One way to fight this problem is to include more controls. The more controls there are, the more likely that one of them will present a closer match to the sample than the supposed candidate author, and the more likely you will be able to identify the false positive as not likely being one of the candidates. (Obviously, this is in direct tension to the 'crude' solution mentioned under the other headings, which is why the crude solution is not recommended if there is a good chance that the actual author is not one of the candidates.)

Another way to fight this problem is to make sure that your list of words consists of those which appear sufficiently often (thus avoiding estimates that are based on faulty indicators, something the chi-square-based method is susceptible to). Along with this, make sure that the sample is not overly large (less than 5000 words max). After checking these things, make sure that the Z-score-based P-value for the candidate author is generously large (greater than 0.05-0.10 seems quite common for authors). This is an absolute reference point, instead of a relative reference point, for the similarity of the 'Sample' to the author. If the candidate author being proposed is simply the best of a bad bunch of matches, the P-values may be low in absolute terms.

(3) False Negative (and the Actual Author Is Listed among the Candidates)

Check that everything is set up sufficiently well: make sure that the 'Sample' is generous (>2000 words preferrably) and that the list of 'words' includes enough features to discriminate between authors effectively.

This can happen when there is a "dead ringer" or "look-alike" for the actual author among the 'controls'. What I mean by this is that there may be another extracted candidate-author/control that has average/mean values for each of the 'words' being considered that are very close to the average/mean values for the actual author.

This can happen, and it is just a natural consequence of the type of measurements being made. Stylometry is less like genetic sequencing (comparing whether two samples are a DNA match) and more like biometrics (comparing whether a person matches the height, weight, sex, build, and description of the suspect). Even with a modestly-adequate list of biometric or stylometric measurements, there will typically be more than one person in the population that has these measurements.

The crude solution to this is to remove your "dead ringer" from the list of 'controls'. This will mean that the results are not as robust as they might be otherwise (especially if the author might not be any of the candidate authors, which is what the controls are there to gauge), but it could keep this "look-alike" from interfering with the task of determining whether the sample is more like the (believed) actual author than like the other candidates.

A better solution is to adjust (add to, subtract from, or modify) the list of words and/or to take a larger 'Sample' (if possible). This may provide the data that is needed to tell the two "look alikes" apart (for example, perhaps one of them has the stylometric equivalent of a "mole on the left elbow" but the other doesn't--adding another preposition or two to consideration may be just what is needed).

(Words may need to be subtracted, not just added, if they are not, in fact, invariant regardless of subject, for this particular author.)

Post by **Peter Kirby** » Tue May 26, 2015 3:44 pm

Declension of Greek nouns and conjugation of Greek verbs can be conveniently accessed here:

http://en.wiktionary.org/

Data for Greek word frequency can be accessed here:

http://perseus.uchicago.edu/GreekFrequency.html

With this information you can find other words that you might want to consider as "function words" (a term in stylometry for words that appear frequently, are not a matter of the subject being discussed, and can be considered a matter of style in their use) to be added to your list.

Post by **Peter Kirby** » Tue May 26, 2015 4:10 pm

Here are some sample files that you can work with.

Post by **Peter Kirby** » Wed Jun 03, 2015 7:28 pm

The chi-square-based method (Fisher's method) for aggregating p-values is, in this context, too unreliable and has been 'commented out' of the source code (and thus not calculated or displayed).

Aleph One · Post by **Aleph One** » Sat Jun 06, 2015 2:05 am

Wow! Really impressive work here (and it's obvious that lots of it went into this)! I applaude this for the effort to add some objective criteria into the analysis of ancient texts. At least with these kind of calculations, it's clear for all to see how the final results were arrived at. At least in an argument over these types of results it's clear_what_we're arguing about.

It is intimidating to someone who doesn't know ancient greek though.

I can hardly imagine how much work was put into the calculator, and this thread explaining it, but there are some (pretty obvious, I'm sure you've thought of them) things that would make life easier for the user, and the tool much more powerful, that stand out:

-Check boxes/drop down lists for the common word (features), and a database of greek documents that could be selected at will for the analysis by their english title and chapter contents. (I know, this is no minor request).

-A way for the program to analyze any sample text and automatically come up with a list of it's common words and their statistics, without having to specify beforehand which word features you're interested in, maybe through a built-in database of greek words/features. This way you could get an idea of which word features are good choices to use with your sample. (The list of 20 important word features you provide is aimed at helping this issue, I think.)

-More of the same (and more tall orders

): An ability to save your state between sessions and default/preset set-ups would save frequent users at lot time and energy copy/pasting. Maybe most importantly of all, this would make it more feasible to share one's findings with others, and allow them to study what exactly you did and how you did it.

I'm sorry if this comes off as critical at all; I'm very excited by this kind of tool and it's hard for me not to let my mind run wild and imagine what might be possible. Obviously if you could get a database of sources connected to it and a way to save states, adding in more/different algorithms of analysis might not seem that hard, and really be the start something amazing.

Jeff

Biblical Criticism & History Forum - earlywritings.com

Basic Stylometry Beta (early access)

Basic Stylometry Beta (early access)

Re: Basic Stylometry Beta (early access)

Re: Basic Stylometry Beta (early access)

Re: Basic Stylometry Beta (early access)

Re: Basic Stylometry Beta (early access)

Re: Basic Stylometry Beta (early access)

Re: Basic Stylometry Beta (early access)

Re: Basic Stylometry Beta (early access)

Re: Basic Stylometry Beta (early access)

Re: Basic Stylometry Beta (early access)