Jump to content

Ashoat

Chief Financial Officer
  • Posts

    6,455
  • Joined

  • Last visited

  • Days Won

    37

Posts posted by Ashoat

  1. Sorry for the slowness. Johnny required a manual fsck as their was some disk corruption that resulted from the power failure. I just finished resolving those issues, and Johnny should be up soon.

  2. HE's report:

    At approximately 6:00am PDT, at the Fremont 1 facility, power was interrupted to

    the datacenter floor when the main UPS input breaker tripped. The breaker was reset,

    the maintenance bypass closed, and power was restored at approximately 6:20am PDT.

    Support technicians on site visually inspected customer equipment located in the

    affected area, restarting any that needed assistance.

     

    UPS technicians will be onsite today to assist us in determining the cause of the failure.

     

    Moving forward, we are now going to replace this UPS system with another brand as

    soon as possible and we are going to go from an N+1 UPS configuration to a 1+1

    electrical system, with each PDU having a static transfer switch to a separate electrical

    distribution system so that if something happens all the PDUs can be switched over.

    The time line for this is in the next few months due to the lead time on getting this type

    of equipment.

     

    The power failure actually only lasted about twenty minutes, but since most of our servers are running ext2 to avoid wasting too much CPU time on kjournald we had to run a long fsck on them. Not sure what's going on with Johnny right now. I can't SSH in, so I'll have to log in via VMWare vSphere when I get home.

  3. Geoff: are you sure you're phrasing that correctly? A nameserver is authoritative for a domain if that domain (or its earliest parent doman) has an NS record for that nameserver. In this sense, I would imagine that ns2.heliohost.org is authoritative for all HelioHost domains.

     

    Are you saying that Cody doesn't have the records for a lot of *.heliohost.org domains? That might be the case. I'm running the following command on Cody right now to sync all the records:

    /scripts/dnscluster syncall --full

     

    If this happens again in the future, you could try running that command and seeing if it helps. Make sure to run it in screen, though, as it takes a while.

     

    UPDATE

    Syncing is done. "rndc reload" is failing for some unknown reason (possible related to the issue at hand), so I'm running /scripts/restartsrv_named.

     

    UPDATE

    Looks like named.conf was broken on Cody. Consequently, ns2.heliohost.org is currently down. Running a script (mv /etc/named.conf /etc/named.conf.broken; /scripts/rebuildnamedconf) on Cody to rebuild named.conf...

     

    UPDATE

    Okay, named.conf rebuilt and named restarted on Cody. Both nameservers seem to be correctly responding to DNS requests now.

  4. It seems that ns1.heliohost.org (Stevie) has the right results for this MX record, but ns2.heliohost.org (Cody) does not. However, the actual zone files seem to be intact on Cody. I'm syncing zone files right now, but once that's done I'll restart named on Cody in the hopes that that will correct this issue.

     

    Please continue discussion of this issue in this thread. Closing...

  5. I'm generally on IRC, but I won't notice your message unless you type my username in your message or if you /msg me. My username is "ashoat" (my first name).

     

    We definitely can't upgrade our install of Perl. I tried this once and ran into a lot of difficulties since cPanel is mostly written in Perl. If there was a way to install a newer version side-by-side then I'd be up for that, but I wasn't able to do it last time (if I recall correctly).

  6. I'm guessing that cPanel EasyApache just disables those extensions by default, even though PHP has them on by default. We'll enable them next time we rebuild.

     

    List of PHP modules we are currently going to add when we rebuild:

    fileinfo

    intl

     

    As for Roundcube, it has a process time leak somewhere. I disabled it (the most recent disabling) after I found it using far too much CPU time on one of the servers. This issue might have since been patched, though... I'm not sure.

  7. Geoff: no, that script is definitely not a good idea. If we really wanted to fill up resources on the queue then we could do so by changing bihourly.php settings to make a single process stay permanently open. That way, we won't have two account creations happening simultaneously, which usually is really bad for the HDD.

  8. I usually do my killing through htop. I open it up, set it to tree mode (F5), find the bihourly tree (/bihourly), select everything in the tree (go up and down with the arrow keys and select with the spacebar) and then kill everything selected (F9, enter).

     

    You could alternately use killall. killall bihourly_wrapper should do the trick; let me know if it doesn't.

×
×
  • Create New...