User:Matthias Buchmeier/download

From Wiktionary, the free dictionary
Jump to navigation Jump to search
Download of ready-to-use dictionaries for:

ding (text-format)
dictd (Goldendict etc.)
dsl (ABBYY Lingvo)

  • Alternatively you can download the dictionaries with the bash-scripts below or via copy and paste from the browser.
  • In order to use the scripts below you need to have lynx and awk installed on your PC.

Download a single dictionary[edit]

Code:[edit]

#!/bin/bash
#printout usage info:
if [ $# -lt 2 ] 
then
echo "usage: gettextdictionary.sh SOURCE-ISO  TARGET-ISO >dictionaryfile.ding";
echo "where SOURCE-ISO  and  TARGET-ISO are the iso-languagecodes of the source- target-language resp., e.g \"es\" \"en\" for the Spanish-English dictionary"
exit;
fi

iso=$1
iso2=$2
WIKIPATH=User:Matthias_Buchmeier/$iso-$iso2
 
for letter in a b c d e f g h i j k l m n o p q r s t u v w x y z 0
do
lynx -width=1000 -nolist -underscore -dump -assume_charset=utf-8 -display_charset=utf-8 "http://en.wiktionary.org/w/index.php?title=$WIKIPATH-$letter&printable=yes" |\
awk '/::/ {gsub(/[\ ]+/, " "); gsub(/^[\ ]/, ""); print;}' 
done

Download all dictionaries[edit]

Code:[edit]

#!/bin/bash

installdir="."
#installdir=/usr/share/trans

# test for existence of lynx and awk
for PROG in lynx awk
do
command -v $PROG >/dev/null 2>&1 || { echo >&2 "Program $PROG is required but it's not installed.  Aborting."; exit 1; }
done

function download { 
echo "downloading "$2
cat /dev/null>$2
for letter in a b c d e f g h i j k l m n o p q r s t u v w x y z 0
do
lynx -width=1000 -nolist -underscore -dump -assume_charset=utf-8 -display_charset=utf-8 "http://en.wiktionary.org/w/index.php?title=$1-$letter&printable=yes" |\
awk '/::/ {gsub(/[\ ]+/, " "); gsub(/^[\ ]/, ""); print;}'>>$2 
done
}

for lang in es it pt fr nl de fi no sv cs hu pl ru ja arb cmn fa hi vi el he tr ko bg ro ca sh da
do
WIKIPATH=User:Matthias_Buchmeier/en-$lang
TARGETPATH=$installdir/en-$lang-enwiktionary.txt
download $WIKIPATH $TARGETPATH
done

for lang in es it fr fi pt
do
WIKIPATH=User:Matthias_Buchmeier/$lang-en
TARGETPATH=$installdir/$lang-en-enwiktionary.txt
download $WIKIPATH $TARGETPATH
done

Authorlist Compilation[edit]

The Creative Commons Attribution-ShareAlike 3.0 Unported License requires the inclusion of:

  • c) a list of all authors. (Any list of authors may be filtered to exclude very small or irrelevant contributions.)

if you want to redistribute the text-dictionaries. The following code can be used to download a list of all users of en.wiktioary.

  • Required programs: bash, gawk, wget, tail (from gnu-coreutils)

Code:[edit]

#!/bin/bash
# generates list of enwiktionary contributors, sorted by number of edits
# exclude contributors with less edits:
EDITTHREASH=200
TEMPFILE=./users-unsorted.txt
TARGET=CREDITS
#APIFLAGS=\&redirects\&aulimit=500\&auexcludegroup=bot
# include all bots, as inactive bots will be included anyhow
APIFLAGS=\&redirects\&aulimit=500

wget --quiet "https://en.wiktionary.org/w/api.php?action=query&list=allusers&format=xml&auprop=editcount&auwitheditsonly$APIFLAGS" -O - 2>>WgetErr.txt\
|gawk -f userlistfilter.awk -v THREASH=$EDITTHREASH >$TEMPFILE
NEXT=`tail -n 1 $TEMPFILE|gawk 'BEGIN {FS="\t";} /^NEXT/ {print $2;}'`

echo $NEXT

while [ "$NEXT" != "THELASTUSERLIST" ]
do
wget --quiet "https://en.wiktionary.org/w/api.php?action=query&list=allusers&format=xml&auprop=editcount&auwitheditsonly&aufrom=$NEXT$APIFLAGS" -O - 2>>WgetErr.txt\
|gawk -f userlistfilter.awk -v THREASH=$EDITTHREASH >>$TEMPFILE
NEXT=`tail -n 1 $TEMPFILE|gawk 'BEGIN {FS="\t";} /^NEXT/ {print $2;}'`
echo "$NEXT"
done

sort -r $TEMPFILE|gawk 'BEGIN {FS="\t";} /^[0-9]/ {print $2;}' >$TARGET
rm $TEMPFILE

AWK-script (must be saved to userlistfilter.awk)[edit]

BEGIN {
RS="><";
Count_Threash=1;
if(THREASH!="") Count_Threash=THREASH;
}

/u userid[=]["]/ {
ID=gensub(/(^.*name[=]["])(.*)(["] editcount.*$)/, "\\2", "g", $0); 
COUNT=gensub(/(^.*editcount[=]["])([0-9]*)(["]).*$/, "\\2", "g", $0); 
if(1.0*COUNT>=Count_Threash) {printf "%08d\t", COUNT; print ID;}
}

/continue aufrom[=]["]/ {
NEXT=gensub(/(^.*continue aufrom[=]["])(.*)(["] continue.*$)/, "\\2", "g", $0);
# ampersand has to percent-encoded with %26
gsub(/[&]/, "%26", NEXT);
}

END {
if(NEXT=="") { print "NEXT\tTHELASTUSERLIST"; exit;}
print "NEXT\t"NEXT;
}

Pageviews[edit]