User:Interwicket/code/mbwa

#!/usr/bin/python
# -*- coding: utf-8  -*-
# wikipath en wiktionary User:Interwicket/code/mbwa


"""
This bot updates iwiki links between wiktionaries

It runs as a set of threads, each doing part of the process. The first group of threads generates
tasks (instances of class Task), from the wiki indexes, deletions from same, recent changes, and
null tasks. These are then picked up by the main thread, and distributed to 7 (identical) "hunter"
threads, which read the entries for the title. The pages are then passed to a group of 4 (again
identical) "replink" threads which write the changes (in module reciprocal.py)

Mbwa maintains a local database of all of the titles in all of the 170 wikts, so it knows which
entries will need links. The database is built and updated automatically, it has no trouble with
starting from "scratch". It is divided into 26 files, a to z on titles, titles outside a-z are
reduced modulo 26 on the first letter to a-z. This is done to prevent having to rebuild the entire
index when bsddb (eventually) corrupts one file; that file can be deleted and the process restarted.

Add tasks: this thread reads the title index and language links from each with the MW API. It
queues a task for each apparent discrepency between an entry and the index. In the case where the
title is "new", it adds the links to the index, and does not queue a task, this is essential to
startup (or re-building one index file), if there are links needed they will be found while reading
the index for another wikt (or the same one on the next pass). This thread also generates sets of
titles queued to the next thread (delete). The indexes are read with adjusted delays in between
until the interval is once per week per wikt.

Deletes: this compares each set of titles from add tasks to the indexes in the "inverse" direction,
looking for titles and links in the index not found in the live wikt. In each case it generates
a delete task, added to the main queue. This, in combination with the tasks queued to add links,
ensures that the links will be brought up to date, regardless of RC missed, page moves, reversions,
and any other odd events.

Recent changes: reads RC (from the MW API) from each wikt for new entries more than one hour old
and less than two days. It treats bot entries (including those not flagged as "bot", but where the
username ends with "...bot") as a lower priority when queuing them. This helps give RC entries
created by humans priority. The changes are read at adapted intervals: wikts that show changes are
read increasingly frequently, those that are not are read more seldomly, to a maximum of about 1 day.
The overall rate of API requests is thus kept at a minimum, while still being responsive in finding
changes.

Null: this thread generates a null task at intervals, to keep the main thread spinning.

Main thread: initialization, start other threads; then for each task: read the "primary" entry
referred to by the task object (in the case of a delete, see if another can be found), check the
links, if not complete, set up the task to be passed to a hunter thread.

Hunter: given the set of links from the page and the index, and others found along the way, read
each page to be (potentially) updated for this title. Then queue all the pages with complete sets
of links to the replink threads. All threads use a shared tick-tock timer to limit the page
read rate.

Replink: (in reciprocal.py) replace the links in the page and write if needed, this uses another
tick-tock to limit the page update rate.

Name:

"MBWA" can be given two meanings; the simplest is the acronym for Management By Wandering Around,
that is, the mbwa program just looks for things worth doing, not trying to do one particular list
in a particular order. However, it will -- eventually -- get to everything.

The other explanation is more complicated: in the 1970s I worked on a VLSI design system, that
produced NMC tapes for a photoplotter, that in turn produced the chrome and glass reticles used
to expose the pattern on silicon wafers.

The pattern tapes had the flashes (each a precisely positioned rectangle) sorted into the optimum
order for the photoplotter, to minimize the amount of motion between flashes. (As optimized runs
would still take 36 or more hours, this was very important!) The optimum order for typical arrays
such as memory cells was boustrophedonic -- up one row and down the next -- as the ox ploughs.

In jest, we referred to an unsorted order as "urocanic" -- as the dog pisses.

Since this program appears random in behaviour, and to a great extent is, that would seem to apply,
although it doesn't reduce the performance.

So mbwa: Kiswahili for "dog".

RLU 10.2.9
"""

import wikipedia
import xmlreader
import sys
import socket
import re
from time import *
from random import random, expovariate, choice
from mwapi import getwikitext, getedit, readapi, getticktock
from reciprocal import addrci, flws, getflstatus, replink, plock, updstatus, toreplink, setreptick
from iwlinks import getiwlinks

# borrow global:
from config import usernames

def srep(s):
    return repr(u''+s)[2:-1]

def reval(s):
    # tricky as repr uses either ' or ", but uses ' if both are present, escaping it
    if "'" in s and '"' not in s: return eval('u"' + s + '"')
    else: return eval("u'" + s + "'")

def safe(s): return srep(s)

reblank = re.compile(r'\[\[[A-Za-z-]+\s*:[^\]]*\]\]')
def isblank(t, p):
    # like isEmpty in wikipedia, but much better and faster, reduces to (almost) identical
    if len(t) > 20 and '[[' not in t[:20]: return False
    # which is the 99% case, except for images atop, (and pl.wikt ;-)
    if len(reblank.sub('',t).strip('\n ')) > 4: return False
    return p.isEmpty() # resort to exact framework test, so as not to war with others

respace = re.compile(u'[ _\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029' +
                     u'\u202F\u205F\u3000]+')

def fixtitle(t):
    # fix a page title the way the server does to make DB keys, as pybot framework is out of date
    # code from server:

    # $dbkey = preg_replace( '/\xE2\x80[\x8E\x8F\xAA-\xAE]/S', '', $dbkey );
    #   (is BIDI, not done yet)
    # $dbkey = preg_replace( '/[ _\xA0\x{1680}\x{180E}\x{2000}-\x{200A}\x{2028}\x{2029}
    #    \x{202F}\x{205F}\x{3000}]+/u', '_', $dbkey );

    t = respace.sub(' ', t).strip(' ')
    return t

import shelve

Lcode = { }
Exists = set()
site = { }
naps = { }

def now(): return int(clock())

Quit = False

import threading, Queue
tasq = Queue.PriorityQueue()
# 35K max to keep process memory < ~100MB
# 70K max to handle state of affairs (17.3.9), limit is soft in code below

# hunter queue, just big enough to distribute tasks to 7 or so hunters, soft limit
huntq = Queue.Queue()

# delete task queue, project sets for deltasks to check
delq = Queue.Queue()

# a single index file is 650MB (20.2.9) and if it gets corrupted, we start over
# so we use 26 of them, -a through -z, each title being in the file that is = mod 26 to the first char
# (note this will be different for some chars in a wide build, we just use the first "UTF-16" word)
# a corrupted file can just be deleted and will be re-built

mifs = range(0, 26)  # list, to be shelves

# modulus ops on titles; note on a "narrow" build, this only uses the top half of surrogate pairs
def mif(s): return mifs[ord(s[0])%26]
def lkey(s): return 'hijklmnopqrstuvwxyzabcdefg'[ord(s[0])%26]

# Mi = shelve.open("mbwa-index")
milock = threading.RLock()
# (could use 26 locks, but this isn't going to matter that much?)

# mwba index is keyed with srep(title), value is tuple of time, links, redirs

def menc(ul, ur): return "%s,%s" % (" ".join(ul), " ".join(ur))
def mdec(s):
    sp = s.split(',')
    return sp[0].split(), sp[1].split()

# rather boring sequence of set routines:

def miget(title):
  # get the links and redirs list as far as we know
  with milock:

    Mi = mif(title)

    if srep(title) in Mi:
        ul, ur = mdec(Mi[srep(title)])
        return ul, ur

    return [], []

def miadd(code, title, others = []):
  # add a link from the wikt indexes, as well as others when first found
  with milock:

    Mi = mif(title)

    if srep(title) in Mi:
        ul, ur = mdec(Mi[srep(title)])
    else:
        ul = []
        ur = []

    if code in ul: return
    if code in ur:
        ur.remove(code)

    if not ul:
        us = set(others)
        us.add(code)
        us -= set(ur)
        ul = sorted(us)
    else:
        ul.append(code)

    Mi[srep(title)] = menc(ul, ur)

def midel(code, title):
  # delete a link when we find no page
  with milock:

    Mi = mif(title)

    if srep(title) in Mi:
        ul, ur = mdec(Mi[srep(title)])
    else:
        return

    if code in ul: ul.remove(code)
    if code in ur: ur.remove(code)

    if not ul and not ur: del Mi[srep(title)]
    else: Mi[srep(title)] = menc(ul, ur)
        
def mired(code, title):
  # found a redirect, move to redirs list
  with milock:

    Mi = mif(title)

    if srep(title) in Mi:
        ul, ur = mdec(Mi[srep(title)])
    else:
        ul = []
        ur = []

    if code in ur and code not in ul: return

    if code in ul:
        ul.remove(code)

    if code not in ur:
        ur.append(code)

    Mi[srep(title)] = menc(ul, ur)

def miset(title, ul, ur):
  # at completing an entry (title), record links and redirs
  with milock:

    Mi = mif(title)

    Mi[srep(title)] = menc(ul, ur)
    Mi.sync()

def miall(tix, nap = 1.0):
    # get the index keys for one of the files (title is a-z)
    # takes a non-trivial amount of time, but less than a minute
    # then return title, links, redirs for each
    # caller must take care not to block as we have lock held
    # 200 on each lock, as locking on each entry is ridiculous amounts of CPU!

    Mi = mif(tix)
    with milock: mk = Mi.keys()

    klim = len(mk)
    k = 0
    while k < klim:
        i = 0
        with milock:
            while k+i < klim and i < 200:
                if mk[k+i] in Mi: # (may have gone away ;-)
                    ul, ur = mdec(Mi[mk[k+i]])
                    yield reval(mk[k+i]), ul, ur
                i += 1
        k += 200
        sleep(nap) # while not holding lock

    # done

# read page titles and links from wikts, return apparent mismatches

from getlinks import getlinks

def livelinks(home):

    redirs = flws[home].redirs

    # sets of titles present, for delete scan
    pset = { }
    for k in 'abcdefghijklmnopqrstuvwxyz': pset[k] = set()

    # read page title and links from the wikt, compare to our index

    for title, links, redflag, bad in getlinks(flws[home].site, plock = plock):

        if Quit: break

        pset[lkey(title)].add(title)

        # if a redirect, add as such, continue
        if redflag:
            mired(home, title)
            continue

        ul, ur = miget(title)

        # ll is set, validate ...
        ll = set()
        for lc in links:
            lc = str(lc) # not unicode (at Py2.6)
            if lc not in flws:
                # odd case, WMF server thinks it is a language (user copied links from 'pedias?)
                # remove these: (will set lockedwikt in flw.__init__)
                flws[lc].deletecode = True
            if flws[lc].lockedwikt and not flws[lc].deletecode: continue
            # we want to edit to delete in second case
            ll.add(lc)

        # if there are no links at all (not even home!) then this is a new title (to us)
        # (there may have been redirects) don't do it on this pass (if it is itself okay, not bad)
        if not ul and not bad: 
            miadd(home, title, sorted(ll))
            continue

        # make sure this wikt is present for title (it is not a redirect)
        if home not in ul:
            miadd(home, title)
            ul, ur = miget(title)

        if redirs: ul += ur
    
        # compare links to ul, should match
        # first add home to ll, then it should be identical
        ll.add(home)

        # if not redirs, but some present, is okay (at this point):
        if not redirs and ur:
            for lc in ur: ll.discard(lc)
            # (also no point in trying to read them in hunt ;-)

        # similar but different case for nolink, e.g. pl->ru
        for lc in flws[home].nolink:
            if lc in ul: ll.add(lc) # pretend present for comparison

        # if apparent mismatch, or bad link(s) in the entry
        if sorted(ll) != sorted(ul) or bad:

            lcs = set(ul)
            lcs.discard(home)

            lnotu = [x for x in ll if x not in ul]
            unotl = [x for x in ul if x not in ll]

            # with plock: print "    in LL, not in UNION:", lnotu
            # with plock: print "    in UNION, not in LL:", unotl

            # some difference, so nk always > 1
            yield title, lcs, ur, len(lnotu) + len(unotl) + 1, bad

        # else: with plock: print "(%s matches)" % repr(title)

    for k in 'abcdefghijklmnopqrstuvwxyz':
        delq.put( (home, k, pset[k]) ) # start scans for deletions in this wikt

# extraordinarily silly omission from Python
def sign(x):
    if x < 0: return -1
    if x > 0: return 1
    return 0

kloset = clock()
def klo(): return (clock() - kloset) / 1000.0

# Task class, with comparison key r controlled for time
# also slots optimization, (yes, tacky, but we gen up a lot of these)

class Task:
    __slots__ = [ 'home', 'title', 'r', 'nk', 'src', 'force', 'page', 'onq' ]
    def __init__(self, home='', title='', r=0, nk=0, src='', force=False, onq = 0):
        self.home = home
        self.title = title
        self.r = r or expovariate(.01)
        self.nk = nk
        self.src = src
        self.force = force
        self.onq = onq or time()
        self.page = None # added in main thread

        # now set r to move forward, run queue, not bury things forever
        while self.r > 4200.0: self.r -= 700.0
        self.r += klo()

    def __cmp__(self, other):
        return sign(self.r - other.r)

# return tasks in random but prioritized order

qpw = { }  # queue per wikt, to implement max and to-be-done
qpwlock = threading.Lock()

# keep track of least recently used (done) wikts, persists on disk
lru = shelve.open("mbwa-lru-list")

# last seen, a "timeout set", a set that elements magically disappear from after a time

from weakref import WeakValueDictionary
from heapq import heappush, heappop

class tmo(float): pass

class lastseen():

    def __init__(s, timeout):
        s.timeout = timeout
        s.wdict = WeakValueDictionary()
        s.theap = []

    def add(s, key):
        t = tmo(clock())
        s.wdict[key] = t
        heappush(s.theap, t)

    def __contains__(s, key):
        while s.theap and s.theap[0] < clock() - s.timeout: heappop(s.theap)
        return key in s.wdict

def addtasks():

    np = 0.0

    # entries seen already, use Weak-Val-Dict directly, on titles->tasks
    # title will be "in" seen if still on queue or hunt
    seen = WeakValueDictionary()

    # init lru:
    for lc in Exists:
        lc = str(lc) # not unicode
        if flws[lc].lockedwikt: continue
        if lc not in lru: lru[lc] = 0.0
    # cleanup:
    for lc in lru.keys():
        if flws[lc].lockedwikt: del lru[lc]
    lrulen = len(lru)

    # one off 22.5.10
    # lru['ml'] = 0

    # for each, read everything, set priorities

    mint = 20

    while not Quit:

        np += 1.0/lrulen    # "pass" number

        # find least recently done wikt:

        home = sorted(lru.keys())[0] # or any other given code, not in unicode, but lru key
        old = lru[home]
        for lc in lru:
            if lru[lc] < old:
                old = lru[lc]
                home = lc

        # if less than a week has passed, sleep for a while, and then go ahead
        # 70 min * 150 wikts is about a week (168 hours), is about the target
        if lru[home] + (168 * 3600) > time():
            mint = min(mint+2, 90)
        else:
            mint = max(mint-5, 20)

        for i in range(mint, 0, -1):
            with plock: print "(add tasks: sleeping, %s next in %d)" % (home, i)
            sleep(60)
            if Quit: break
        if Quit: break

        # record it now, so if we fail (or are aborted) we go on to the next one
        lt = strftime("%d %B %Y %H:%M:%S", gmtime(lru[home])).lstrip('0')
        if not lru[home]: lt = '[never]'
        lru[home] = time()
        lru.sync()

        # queue all within process memory reason
        qmax = (70000 - tasq.qsize()) / 2
        with qpwlock: 
            if home not in qpw: qpw[home] = 0
            qmax += qpw[home]

        with plock: print "(reading links from %s.wikt, last done %s, qmax %d)" % (home, lt, qmax)
        # skip codes we don't want
        # we only want "home" wikts with bot or noflag status

        flw = flws[home]
        ponly = False
        status = getflstatus(flw, nowrite = True)
        if status not in ["bot", "noflag", "globalbot", "test", "blocked"]:
            # we don't want this one at all
            with plock: print "(%s status is %s, not reading at all)" % (home, flw.status)
            continue
        if status not in ["bot", "noflag", "globalbot"]:
            with plock: print "(%s status is %s, not queueing from links)" % (home, flw.status)
            ponly = True
                   
        tf = 0
        qt = 0

        for title, lcs, urs, nk, bad in livelinks(home):

            if Quit: break

            if ponly: # not bot or noflag, we are just counting them
                tf += 1
                continue

            # if it is known as a redirect, skip it (odd case)
            if home in urs: continue

            # clip main page here if we can
            if title.lower() == 'main page': continue
            if title == flw.mainpage: continue

            if title in seen: continue # queued already on this run, counted in qpw

            tf += 1 # found a (new) task

            if qpw[home] > qmax: continue # doesn't make the cut

            with qpwlock: qpw[home] += 1

            r = expovariate(.001)
            t = Task(home=home, title=title, r=r, nk=nk, src='idx', force=bad)
            seen[title] = t

            # if lots on queue, no hurry ...
            if tasq.qsize() > 500: sleep(1)

            tasq.put(t)
            qt += 1

        if Quit: break
        with plock:
            print "(found %d tasks for %s, queued %d, queue size %d)" % (tf, home, qt, tasq.qsize())
        if flw.status in ["bot", "noflag", "globalbot", "test"]:
            flw.tbd = tf - qt + qpw[home] # total less queued on this pass plus on-queue (!)
            updstatus(flw)

        sleep(70) # rest between indexes. Is a huge effort, reading them ... (:-)
        # (real reason is to keep process from spinning when little to do, main timer above)

    # end of trueloop
    with plock: print "(add tasks thread ending)"

# now recent changes set

def recent(home = 'en'):

    # entries seen already, timeout set, keep for > 3 days (use 4)
    seen = lastseen(4 * 86400)

    # set up list of wikt codes to look at

    qtime = { }
    rcstart = { }
    for lc in Exists:
         if flws[lc].lockedwikt: continue
         site[lc] = flws[lc].site
         naps[lc] = 60 * choice(range(3, 91)) # scatter 3 to 90 minutes
         if lc == home: naps[lc] = 0 # no wait for home wikt the first time
         qtime[lc] = now() + naps[lc] # initial time
         rcstart[lc] = ''

    ny = 0

    rcex = re.compile(r'<rc[^>]*title="(.+?)"[^>]*>')
    rccont = re.compile(r'rcstart="(.+?)"')
    rcisbot = re.compile(r'<rc[^>]*user="[^>]*bot"[^>]*>', re.I)

    while not Quit:

        # sleep until next one
        nextq = now() + 1000000
        nextlc = ''
        for lc in qtime:
            if qtime[lc] < nextq:
                nextq = qtime[lc]
                nextlc = lc
        st = nextq - now()
        # if st > 90:
        #    with plock: print "(%d, sleeping %d minutes, %s next)" % (now(), (st+29)/60, nextlc)
        if st > 0:
            sleep(st)
        if st < -120:
            with plock: print "(rc %d minutes behind)" % (-(st-29)/60)
        lc = nextlc

        if Quit: break

        # for mbwa, only read rc from non-troublesome wikts
        # this saves the bother of looking at closed wikts too
        flw = flws[lc]
        if getflstatus(flw, nowrite = True) not in ["bot", "noflag", "globalbot", "test"]:
            with plock: print "(%s status is %s, not reading rc)" % (lc, flw.status)
            qtime[lc] = now() + 86400  # look again tomorrow
            continue

        # read recentchanges, new entries, namespace 0, from site:

        if True: # [indent]

            # with plock: print "(%d, reading from %s.wikt)" % (now(), lc)
            nf = 0

            # set parameters

            # from a little while ago (8 hours)
            if not rcstart[lc]:
                rcstart[lc] = '&rcstart=' + strftime('%Y-%m-%dT%H:%M:%SZ', gmtime(time() - 8*3600))

            # up to one hour ago
            rcend = '&rcend=' + strftime('%Y-%m-%dT%H:%M:%SZ', gmtime(time() - 3600))

            # slow start, don't need to pick up too quickly
            rclimit = "&rclimit=%d" % min(10 + ny/20, 200)

            # with plock: print "(options " + rcend + rcshow + rclimit + ")"

            try:
                rct = readapi(flw.site,
                     "action=query&list=recentchanges&format=xml&rcprop=title|user|flags&rcdir=newer" +
                     "&rctype=new&rcnamespace=0"+rcend+rcstart[lc]+rclimit,
                     plock = plock)
            except wikipedia.NoPage:
                with plock: print "can't get recentchanges from %s.wikt" % lc
                # rct = ''
                # sleep(30)
                qtime[lc] = now() + 700  # do other things for a bit
                continue
            except KeyError:
                # local bogosity
                with plock: print "keyerror"
                sleep(20)
                continue

            # (They've borked the API by making gratuitous changes, we can't check
            #  to see if we have an empty "recentchanges" element, because it isn't
            #  always present now! Look for an attribute too. Sigh.)
            if "<recentchanges />"  in rct or 'recentchanges=""' in rct:
                pass
            elif '</recentchanges>' not in rct:
                with plock: print "some bad return from recentchanges, end tag not found"
                with plock: print safe(rct)
                # rct = ''
                sleep(30)
                qtime[lc] = now() + 300  # do other things for a bit
                continue

            # continue parameter:

            mo = rccont.search(rct)
            if mo:
                rcstart[lc] = "&rcstart=" + mo.group(1)
            else: # we are up to date, set to one hour + 100 sec ago
                rcstart[lc] = '&rcstart=' + strftime('%Y-%m-%dT%H:%M:%SZ', gmtime(time() - 3700))

            found = False
            for mo in rcex.finditer(rct):

                if Quit: break

                title = mo.group(1)

                # unescape, API uses (e.g.) #039 for single '
                title = wikipedia.html2unicode(title)
                title = title.replace('_', ' ')
                if ':' in title: continue
                if not title: continue

                isbot = False
                if 'bot=""' in mo.group(0): isbot = True
                if rcisbot.match(mo.group(0)): isbot = True

                if lc + ':' + title not in seen:

                    lcs, urs = miget(title)
                    # new to us or not?
                    if lc in lcs:
                        lcs.remove(lc) # can happen on restarts (or entry re-created)
                        isnew = False
                    else:
                        isnew = True # new entry created with iwikis, do it anyway
                    nk = len(lcs) + len(urs) + 1

                    if nk == 1: continue # unique title

                    seen.add(lc + ':' + title)

                    if not isbot:
                        t = Task(home=lc, title=title, nk=nk, src='rc', force=isnew)
                    else:
                        t = Task(home=lc, title=title, nk=nk, src='bot', r = expovariate(.0014))

                    tasq.put(t)
                    ny += 1
                    nf += 1
                    found = True

            if found:
                naps[lc] /= 2
                naps[lc] = max(naps[lc], 300) # five minutes
            else:
                mn = naps[lc]/300 # one-fifth, in minutes
                naps[lc] += 60 * choice(range(5, 10 + mn))
                # add 5-10 minutes or longer if we don't find anything
                maxnap = 60 * choice(range(1400, 1500)) # around 24 hours
                naps[lc] = min(naps[lc], maxnap)

            qtime[lc] = now() + naps[lc]
            with plock: print "(rc found %d in %s, next in %d minutes)" % (nf, lc, (naps[lc]+29)/60)


            """
            if naps[lc] > 90:
                with plock: 
            elif naps[lc] > 30:
                with plock: print "(rc found %d in %s, next in %d seconds)" % (nf, lc, naps[lc])
            else:
                with plock: print "(rc found %d in %s, next immediately)" % (nf, lc)
            """

    with plock: print "(recent changes thread ending)"

def deltasks():

  # incoming queue is sets of titles, already sorted by modulus key
  # this is a bit more complicated than just a set for the whole wikt, but allows
  # us to release memory for each set as we go

  psets = { }    # dict by key letter and then lc of sets queued to us
  found = { }    # dict by lc of number found
  tbc = 0        # total titles to be checked

  while True:

    # scan our whole local db, looking for entries that list a title, when the title is not in wikt

    for tix in 'abcdefghijklmnopqrstuvwxyz':

        # look for new task sets, block as we must have one, or not enough to be worth it yet
        # pick up whatever is available now, add to our little structure

        while delq.qsize() > 0 or not psets or tbc < 100000:
            lc, k, pset = delq.get()
            if k not in psets: psets[k] = { }
            if lc in psets[k]: tbc -= len(psets[k][lc])
            psets[k][lc] = pset # if we have somehow wrapped all the way 'round use new one!
            found[lc] = 0       # make sure it exists
            tbc += len(pset)

        # any for this letter?
        if tix not in psets: continue
        ptix = psets[tix]

        with plock:
             print "(starting delete scan for %s/%s, tbc %d)" % (','.join(ptix.keys()), tix, tbc)

        for lc in found: found[lc] = 0

        # read index file
        for t, ul, ur in miall(tix, nap = 1.0): # pole, pole, hakuna matata!

            # NOTE milock held here by miall(), don't take other locks or block
            #      we do take tasq sync lock implicitly, and release it
            if Quit: return # should unwrap everything?

            for lc in ptix:
                if t not in ptix[lc] and (lc in ul or lc in ur):
                    if t != fixtitle(t):
                        # got a bad one somewhere, delete from db now
                        midel(lc, t)
                        continue  # no need to check the fixed title?

                    """ better handled by 'exists' for 'del' below? more general case?
                    if lc == 'ml' and u'\u0d4d\u200d' in t:
                        # "bad" titles from before forced 5.1 "normalization", don't do delete
                        # op, it will add bad iwikis
                        # this prevents thrashing, but doesn't solve problem
                        continue
                    """

                    # queue up a delete task
                    task = Task(home=lc, title=t, src='del')
                    tasq.put(task)
                    found[lc] += 1
                    break # don't look at other wikts, one task is enough

        for lc in ptix:
            if found[lc]:
                with plock: print "(delete scan for %s/%s, %d found)" % (lc, tix, found[lc])
            tbc -= len(ptix[lc])
        ptix = None
        del psets[tix] # done with all for letter key, discard

    # end of True-loop
  with plock: print "(delete thread ending)"

def nulltask():

    wasrt = 10.0

    # keep main task queue and thread slithy et lubriceaux

    while not Quit:
        sleep( min(tasq.qsize() + 70, 350) )
        tasq.put( Task(src='null') )

        # adjust rate (mostly this is for fun, though it is useful in spreading load ;-)
        # below 200, is 7 sec, above 2700 is 2 seconds, at 7000 1 second, at 10K no reptick

        rt = min(max((3700.0-tasq.qsize())/500.0, 2.0), 7.0)

        # and corrections outside range: (cover the range, so no restarts, this is the serious advant)
        if tasq.qsize() > 5000: rt = 1.5
        if tasq.qsize() > 7000: rt = 1.0
        if tasq.qsize() > 10000: rt = 0.0
        if tasq.qsize() < 10: rt = 10.0

        if int(wasrt*10) != int(rt*10):
            with plock: print "(replink ticktock was %.3f, now %.3f)" % (wasrt, rt)
            setreptick(rt)
            wasrt = rt

    with plock: print "(null thread exiting)"

def main():

    socket.setdefaulttimeout(70)

    with plock: flws['en'].site.forceLogin()

    # setup basics

    for c in 'hijklmnopqrstuvwxyzabcdefg':
        mifs[ord(c)%26] = shelve.open('mbwa/mbwa-index-' + c, protocol = 2)

    enw = wikipedia.getSite(code = "en", fam = "wiktionary")

    # make sure we have an flw for everything claimed to be in family (including stops)
    for code in flws['en'].site.family.langs: foo = flws[code]

    # get active wikt list
    # minus crap. Tokipona? what are they thinking? Klingon? ;-) deleted ISO code
    # se has no wiktionary (not even closed), as is locked (but not shown locked in table?)
    Lstops = ['tokipona', 'tlh', 'sh', 'se', 'as']

    sitematrix = readapi(enw, "action=sitematrix&format=xml")

    rematrix = re.compile(r'//([a-z-]+)\.wiktionary')

    sms = set()
    for code in rematrix.findall(sitematrix):
        sms.add(code)
        # print "found code", code, len(sms)
        if code in Lstops: continue
        Exists.add(code)
        foo = flws[code]
        # see if we have a login in user config, else pretend we do
        # has to be done before any call, or login status gets confused!
        if code not in usernames['wiktionary']:
            usernames['wiktionary'][code] = "Interwicket"

    # set delete for anything not in matrix:
    for lc in flws:
        if lc not in sms: flws[lc].deletecode = True
 
    with plock: print "found %d active wikts" % len(Exists)
    if len(Exists) < 150: return

    for lc in Exists:
         site[lc] = wikipedia.getSite(lc, "wiktionary")
         naps[lc] = 0 # nil, might be referenced by hunt()

    with plock: print "starting ..."

    # start task generation threads, then yield queue entries:

    tt = threading.Thread(target=addtasks)
    tt.daemon = True # kill silently on exit (:-)
    tt.name = 'get link tasks'
    tt.start()

    rt = threading.Thread(target=recent)
    rt.daemon = True # kill silently on exit (:-)
    rt.name = 'get recent changes'
    rt.start()

    dt = threading.Thread(target=deltasks)
    dt.daemon = True # kill silently on exit (:-)
    dt.name = 'delete scan'
    dt.start()

    nt = threading.Thread(target=nulltask)
    nt.daemon = True # kill silently on exit (:-)
    nt.name = 'null task generator'
    nt.start()

    # now "hunter tasks"

    for i in range(1, 8):
        ht = threading.Thread(target=hunter)
        ht.daemon = True
        ht.name = 'hunter %d' % i
        ht.start()

    nt = 0

    while True:
        task = tasq.get()

        if task.src == 'null':
            with plock: 
                print '(null points r %.4f on queue %.2f seconds clock %.1f queue %d ' \
                      'hunt %d replink %d tick tock %.1f)' \
                      % (task.r, time() - task.onq, clock() - kloset, tasq.qsize(),
                         huntq.qsize(), toreplink.qsize(), getticktock())
            continue

        # queue limit from/for addtasks (;-)
        if task.src == 'idx': 
            with qpwlock: qpw[task.home] -= 1

        nt += 1

        # Task:
        with plock: print nt, '('+task.src+')', task.home, srep(task.title), \
                          "links", task.nk, "random", "%.4f"%(task.r), "queue", tasq.qsize()

        # locals, and coerce types
        home = task.home
        title = task.title
        ul, ur = miget(title)
        lcs = set(ul)
        urs = set(ur)
        lcs.discard(home)
        urs.discard(home)

        mysite = wikipedia.getSite(home, 'wiktionary')
        page = wikipedia.Page(mysite, fixtitle(title))
        task.page = page
        title = task.title = page.title()

        if ':' in title: continue # redundant, but eh?
        if title.lower() == 'main page': continue
        if not title: continue

        # with plock: print "%s:%s" % (home, srep(title))

        # structure of code here is leftover from source (;-)
        tag = True
        if tag:

            # ... pick up current version

            try:
                # text = page.get()
                text = getwikitext(page, plock = plock)
                oldtext = text
                if isblank(text, page):
                    # we don't want to update other entries, but treat this as missing
                    # we will look at it again every few days, it may then have content
                    with plock: print "    ... page is effectively blank"
                    midel(home, title)
                    text = ''
            except wikipedia.NoPage:
                with plock: print "    ... %s not in %s.wikt" % (safe(page.title()), safe(home))
                midel(home, title)
                # if task.src == 'del' and lcs:
                if lcs: # hmmm...
                    # others?
                    home = lcs.pop()
                    if flws[home].status in ['bot', 'globalbot', 'test', 'noflag']:
                        # requeue to ourselves: (can happen more than once)
                        task.home = home
                        task.src = 'delrq' # hmmm...
                        tasq.put(task)
                text = ''
            except wikipedia.IsRedirectPage:
                with plock: print "    ... redirect page"
                mired(home, title)
                text = ''
            except KeyError:
                # annoying local error, from crappy framework code
                with plock: print "KeyError"
                sleep(20)
                continue
            except Exception, e:
                with plock: print "unknown exception from getwikitext", repr(e)
                sleep(30)
                continue

            if not text: continue

            # if case was delete, and exists, we are done
            # this covers the Malayam (ml) Unicode 5.1 force case, page appears to exist
            if task.src == 'del':
                with plock: print "    ...", srep(title), "exists now"
                continue

            act = ''

            # use our newer code, not framework
            ls = getiwlinks(text, flws).keys()

            # special case for pl here ...
            for lc in flws[home].nolink: 
                if lc not in ls: lcs.discard(lc)

            # wikt links to redirs
            if flws[home].redirs: lcs |= urs

            # list of iwikis in entry should match lcs, if not, we need to update
            if sorted(ls) == sorted(lcs) and not task.force:
                with plock: print "    ...", srep(title), "is okay"
                miadd(home, title) # ensure present in rc case (added with iwikis?)
                continue

            # if not always adding redirs to this wikt, but some present, is ok
            # also nolink wikts
            if (not flws[home].redirs or flws[home].nolink) and not task.force:
                ok = True
                # need to remove something
                for s in ls:
                    if s not in lcs and s not in urs and s not in flws[home].nolink: ok = False
                # need to add something
                for s in lcs:
                    if s not in ls: ok = False
                if ok:
                    with plock: print "    ...", srep(title), "is okay (may have redirects or nolinks)"
                    miadd(home, title)
                    continue

            # go hunt down some iwikis, add reciprocals when needed

            with plock: print "    ... hunting iwikis for", srep(title)
            sleep(huntq.qsize()*5) # q limit to reasonable?
            huntq.put(task)

        # loop on task ends

    # done

def hunter():

    while not Quit:
        task = huntq.get()

        # locals, and coerce types
        home = task.home
        title = task.title
        page = task.page

        links, redirs, complete = hunt(page)
        if Quit: break # return from hunt will not be valid

        # and update this page:
        addrci(page, flws[home].site, links = links, redirs = redirs, remove = complete)

        # record this title as done, links and redirs known
        if complete:
            ul = set(links.keys())
            ul.add(home)
            ur = set(redirs.keys())
            # sorted is nice, and makes lists again
            miset(title, sorted(ul - ur), sorted(ur))
        # else it will get done again at some point, hopefully without exceptions

    with plock: print "(hunter thread ending)"

# wiki-hunt ... see if a word is in other wikts, return list ...

def hunt(page):

    word = page.title()
    text = getwikitext(page, plock = plock) # will just return _contents
    home = page.site().lang

    ul, ur = miget(word)
    totry = set(ul) | set(ur)

    done = set()
    fps = set()
    links = { }
    redirs = { }

    # reiw = re.compile(r'\[\[([a-z-]{2,11}):' + re.escape(word) + r'\]\]')

    # simple scan for existing iwikis, use improved code

    # for lc in reiw.findall(text):
    iws = getiwlinks(text, flws)
    for lc in iws:
        lc = str(lc) # not unicode
        # if lc in site:
        totry.add(lc)

    # not home:
    totry.discard(home)
    done.add(home)

    exceptions = False

    while totry:
        lc = totry.pop()
        if flws[lc].lockedwikt or flws[lc].deletecode: continue

        if Quit: return None, None, False

        try:
            fpage = wikipedia.Page(site[lc], word)
            text = getwikitext(fpage, plock = plock)
            if isblank(text, fpage):
                # we don't want to link to entirely blank pages
                with plock: print "       ", srep(word), "in", lc, "is blank or empty"
                done.add(lc)
                continue # not adding to links
            with plock: print "       ", srep(word), "found in", lc
        except wikipedia.NoPage:
            with plock: print "       ", srep(word), "not in", lc
            done.add(lc)
            continue
        except wikipedia.IsRedirectPage:
            redirs[lc] = fpage
            with plock: print "       ", srep(word), "found in", lc, "(redirect)"
        except Exception, e:
            exceptions = True
            with plock: print "exception testing existence of word", str(e)
            done.add(lc)
            continue

        done.add(lc)
        links[lc] = fpage

        # add to list to add reciprocal link, or complete set, don't (can't :-) update redirects
        if lc not in redirs: fps.add(fpage)

        # look for iwikis in the page, add to to-be-tried if not already done

        iws = getiwlinks(text, flws)
        for lc in iws:
            lc = str(lc) # not in unicode
            if lc not in done and lc not in totry:
                with plock: print "            found further iwiki", lc
                totry.add(lc)

    # all done, now add reciprocals
    # don't remove anything if there were exceptions because hunt may be incomplete
    # if no exceptions, hunt is complete for these entries (there may be others not seen,
    # but then they aren't linked, as we've looked at all links ...), so remove any
    # links not found:

    for fpage in fps:
        if Quit: return None, None, False
        addrci(fpage, site[home], links=links, redirs=redirs, remove=not exceptions)

    # return list of all links and redirects, and flag if complete
    return links, redirs, not exceptions

# end? Finally?

if __name__ == "__main__":
    try:
        main()
    except KeyboardInterrupt:
        print "(keyboard interrupt)"
        # mostly just suppress traceback
    except Exception, e:
        print "exception", repr(e)
    finally:
        Quit = True
        replink(end = True)
        sleep(210) # give a bit of a chance for add tasks/hunt/rc to stop cleanly
        for i in range(0, 26):
            print "closing index file for", 'hijklmnopqrstuvwxyzabcdefg'[i]
            sleep(1) # time for print
            with milock: mifs[i].close()
        lru.close()
        wikipedia.stopme()
User:Interwicket/code/mbwa

Navigation menu

Search