In first day of Azure Acedemy in Hybrid IT track we will focus on compute, storage, networking, backup, DR and monitoring. Each section comes with lot of content and time limit for each section. Try to finish as much as you can. What is left, take as homework.
As some labs take long time to deploy, we will start deployment now and watch results later.
Deploy Azure Disk storage test. There are 12 tests to be run so select reasonable test time to finish. Also note VM extension has script running time limited to 90 minutes so if you need to run tests with long times (such as 60 minutes per test) choose 1 seconds and rerun script manualy (sudo /var/lib/waagent/custom-script/download/0/test.sh 600). For our lab select 120 seconds.
Deploy networking lab environment (keep prefix on default or use your own, but only with lowercase letters)
Use attached scripts.
Note: It takes some time for Backup to be ready. Follow steps 1-3 and then we will continue with next section. After about 1 hour we will come back to continue from step 4.
- Create Windows VM
- Enable backup and create Backup Vault
- Initiate backup by clicking Backup Now (it will create app-consistent snapshot by orchestrating with VSS in about 20 minutes and then we have to wait about 1 hour until backup is transfered from snapshot to vault)
- Go to existing backup and click on restore file, map backup as iSCSI disk to your notebook following instructions
- Unmount backup
- Restore whole VM. You can either create new VM or replace existing (when replacing existing VM needs to be stopped first)
Connect to storagetest VM you have deployed previously and check results of storage test. Types of disks are:
- Standard HDD S20 /disk/standardhdd (sdc)
- Standard SSD E20 /disk/standardssd (sdd)
- Premium SSD P20 /disk/premiumssd (sde)
- 4x Standard SSD E10 in LVM pool /disk/vg/lv (sdf, sdg, sdh, sdi)
- Local SSD /mnt (sdb)
- 2x small Standard HDD with different cache settings /disk/uncached and /disk/cached
- (UltraSSD) - this is in private preview and requires being enrolled to it. Therefore there is separate template and test script for those who have been whitelisted to preview
Three tests are run:
- sync test random write (waiting for ACK after each transaction simulating legacy workload)
- async test random write (256 bulk operations with 4 threads simulating database workload)
- async test random read test comparing cached and non-cached performance
Connect to VM and check results:
export ip=$(az network public-ip show -n storagetest-ip -g storagetest-rg --query ipAddress -o tsv)
ssh storage@$ip
sudo -i
ls
How to read results:
- In /root you will find couple of *.results files with output from FIO tool
- Check IOPS (with multiple writers sum IOPS of each writer) - line write: IOPS=
- Check latency on sync tests, especialy clat percentiles (focus on 50th and 99th)
What to expect:
- For sync legacy-type access latency is most important factor for IOPS. Multiple disk does not help
- Latency is fluctulating on Standard HDD, much less on Standard SSD and is very stable on Premium SSD and Local SSD
- LVM IOPS with Standard SSDs is close to same size Premium SSD, but there is operational overhead to setup LVM and latency is more consistent with Premium SSD
- You can easily hit disk IOPS limit and achieve little more then specified
- With Local SSD you are reaching limit of your VM size, not storage limit (if you need extreme local non-redundant performance check L-series VMs)
- Always be aware of VM size limits, it makes no sense to buy P70 disk when connected to D2s_v3 VM type https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes
- UltraSSD performance is very high - check latency on 50th and 99th percentile, provisioned IOPS set to 14k to demonstrate you are limited by 12800 IOPS of D8s VM (actually it gets slightly higher). You can change provisioned IOPS for UltraSSD without turning VM off!
Try to run some of tests yourself:
sudo -i
cd /root
fio --runtime 60 s-hdd-sync.ini
fio --runtime 60 p-ssd-sync.ini
fio --runtime 60 s-hdd-async.ini
fio --runtime 60 p-ssd-async.ini
- Create storage account v2 GRS-RA
- Create container and upload some files
- Use Storage Explorer to generate SAS token to access file
- Make sure file is also readable on secondary endpoint (secondary region)
- Move one file to Cool tier
- Move one file to Archive tier
- Create policy to automatically manage lifecycle of files:
- Move file to Cool after 1 day of inactivity
- Move file to Archive after 2 days of inactivity
- Delete file after 3 days of inactivity
- Setup immutable policy so files in particular container cannot be removed for 2 days
- Enable soft delete and observe behavior
- Large-scale file copy with AzCopy v10
- Install AzCopy v10 on it https://docs.microsoft.com/cs-cz/azure/storage/common/storage-use-azcopy-v10
- Use AzCopy to copy file from local to blob storage
- Large-scale automated file copy with Data Box Gateway
- Download Azure Data Box Gateway and install it in your hypervisor (use your PC or nested virtualization in Azure using Dv3 or Ev3 machine instance)
- Connect Data Box to Azure and associate with storage account
- Setup copy policies and rate limit
- Via Azure portal create SMB share on your Data Box
- Copy files locally to Data Box and observe files being synchronized to storage account in Azure
- Open storage account v2 create in previous step
- Create file share
- Map share to your local PC (Windows 10) via SMB3 or to VM in Azure (Windows or Linux)
- Create text file and copy it to share
- In Azure create snapshot of your share
- Modify text file locally and make sure change propages to Azure
- In your local Explorer right click on file and go to previous versions so you are able to restore previous version of your file
- Setup Azure Backup to orchestrate snapshotting and backup of your Azure Files
Follow network diagram and instructions in following repo to test comple of networking scenarios: https://github.com/tkubica12/azure-networking-lab
Deploy following template (eg. to netperf resource group) that will create 2x VM in zone 1 + 1x VM in zone 2 + 1x VM in different region and install throughput testing tool iperf and latency testing tool qperf.
Connect to z1-vm1
export z1vm1=$(az network public-ip show -n z1-vm1-ip -g netperf --query ipAddress -o tsv)
ssh net@$z1vm1
Test bandwidth to vm2 in the same zone, vm3 in different zone and VM in different region.
sudo iperf -c z1-vm2 -i 1 -t 30
sudo iperf -c z2-vm3 -i 1 -t 30
sudo iperf -c 10.1.0.4 -i 1 -t 30 # secondary region via VNET peering
sudo iperf -c 10.1.0.4 -P16 -t 30 # multiple parallel sessions
Test latency to vm2 in the same zone and vm3 in different zone.
qperf -t 20 -v z1-vm2 tcp_lat
qperf -t 20 -v z2-vm3 tcp_lat
qperf -t 20 -v 10.1.0.4 tcp_lat
What we expect to happen and what to do?
- Network througput will be close to performance specified in documentation (8 Gbps for Standard_D16s_v3)
- Network throughput is pretty much the same within and across zone, but across zones might fluctulate a bit
- Throughput between regions is WAN link so expected to be slower. Typical behavior is that with single connection speed is very good, but slower than inside region, but using multiple sessions leads performance close to inside regions (but might fluctulate a bit more)
- Thanks to accelerated networking available on some machines including Standard_D16_v3 latency is very good
- Latency inside zone is usually better than latency across zones, but still very good (suitable for sync operations)
- Latency between regions is good for WAN link, but obviously suited more for async operations
- Do not use ping to test latency - it has very low priority in Azure network as well as OS TCP/IP stack
- Do not use file copy to test network throughput as storage can be most limiting factor
We will work on scenario when IP address will retain during failover, check this link.
Use attached scripts.
Steps
- create vnets with peerings
- create primary Windows AD server with DNS zone
- setup DNS on vnets and join web server in Active Directory
- create app web server and configure IIS
Follow this steps https://docs.microsoft.com/en-us/azure/site-recovery/azure-to-azure-tutorial-enable-replication
Try to access website after failover.
Challenges
- DNS routing
- complete scripts for deployment with infrastructure automation - dsc etc.
- complete scripts for failover - loadbalancer etc.
Onboard VM to Azure monitor and check VM health page, performance metrics and service map
Open Logs and search for logs, use filtering and basic capabilities of Kusto language. Follow instructor to search and filter Event logs.
Follow instructor for step by step creation of queries. You will learn about key tables such as Event, Syslog, Perf or Updates, learn how to filter, summarize and use joins.
One of resulting query will combine information from Event, Updates and Perf tables:
Event
| where EventLevelName in ("Warning", "Error")
| summarize EventCount = count() by Computer, EventLevelName
| join kind= leftouter (
Update
| where Classification == "Critical Updates"
| where UpdateState == "Needed"
| summarize UpdatesMissing = count() by Computer
) on Computer
| project-away Computer1
| join kind= leftouter (
Perf
| where CounterName == "% Processor Time"
| where ObjectName == "Processor Information"
| where InstanceName == "_Total"
| summarize CpuLoad99Percentile = strcat(round(percentile(CounterValue, 99), 1), ' %') by Computer
) on Computer
| project-away Computer1
You will also use bin aggregation by time to generate time chart:
Perf
| where TimeGenerated >= ago(1h)
| where ObjectName == "Processor"
| where CounterName == "% Processor Time"
| where InstanceName == "_Total"
| summarize CpuLoad = round(percentile(CounterValue, 95), 1) by bin(TimeGenerated, 5m), Computer
| render timechart
Let's search for underutilized servers so we can potentially downsize and save costs.
Try this query. We will calculate CPU load on percentile (average is not good metric, you typically want to see 90th, 95th or 99th percentile or 100rh percentile which is maximum). Because we want to easily change percentile make it variable (let command).
let Percentile=90;
Perf
| where CounterName == "% Processor Time"
| where ObjectName == "Processor"
| where InstanceName == "_Total"
| summarize cpuLoad=percentile(CounterValue, Percentile) by Computer
| sort by cpuLoad asc
Let's another example - we will print load of individual processes on computer.
let Percentile=90;
Perf
| where CounterName == "% Processor Time"
| where ObjectName == "Process"
| summarize processCpuLoad=percentile(CounterValue, Percentile) by _ResourceId, InstanceName
| sort by processCpuLoad desc
We will now create Workbook. Create new Workbook in Azure Monitor -> Workbooks -> Empty. First add text with something like Underutilization report. Use first query for Add Query and select grid as output and use Logs as data source, Log analytics workspace and select yours.
Next we will want to make this view look better. First replace Computer with _ResourceId so Azure VM is clickable.
let Percentile=90;
Perf
| where CounterName == "% Processor Time"
| where ObjectName == "Processor"
| where InstanceName == "_Total"
| summarize cpuLoad=percentile(CounterValue, Percentile) by _ResourceId
| sort by cpuLoad asc
We will now graphically signal resources that are underutilized and overutilized. Click Column Settings and use Threashold renderer using Icons for cpuLoad column. Add following icons:
- <= 15 to blue information icon and text {0}{1} (Underutilized)
- <= 90 to green available icon and text {0}{1} (OK)
- Default to red error icon and text {0}{1} (overutilized)
Also select Cusom number formatting and set Units to Percentage, Style Decimal and Maximum fractional digits to 2.
Click Apply and Save and Close.
We will want user to be able to select percentile. Add Parameters, name selectedPercentile, display name Percentile, Drop Down, click required and use following JSON to define options:
[
{ "value": 50, "label": "50th"},
{ "value": 75, "label": "75th"},
{ "value": 90, "label": "90th"},
{ "value": 95, "label": "95th"},
{ "value": 99, "label": "99th"},
{ "value": 100, "label": "100th"}
]
Modify query to react on selected percentile:
let Percentile={selectedPercentile};
Perf
| where CounterName == "% Processor Time"
| where ObjectName == "Processor"
| where InstanceName == "_Total"
| summarize cpuLoad = percentile(CounterValue, Percentile) by _ResourceId
| sort by cpuLoad asc
We will now make workbook interactive. When clicking on particular row we want to display per-process details for this computer in another grid bellow. Click Advanced Settings -> When an item is selected, export a parameter and export parameter _ResourceId as selectedResourceId with default value of NA. Also configure chart title and enable client-side search.
Add new query bellow.
let Percentile={selectedPercentile};
Perf
| where _ResourceId == "{selectedResourceId}"
| where CounterName == "% Processor Time"
| where ObjectName == "Process"
| summarize processCpuLoad=percentile(CounterValue, Percentile) by _ResourceId, InstanceName
| project processName=InstanceName, processCpuLoad
| sort by processCpuLoad desc
To make it look better use Column Settings and make processCpuLoad Bar with blue color and use Custom number formatting to limit friction decimals.
Click Advanced Settings and configure the following:
- Make this half screen by setting Make this item a custom width to 50
- Modify title to Server processes
- Enable client-side search with Show filter field above grid or tiles
- Hide this chart if no server is selected by clicking Make this item conditionally visible when selectedResourceId is not equal to NA
We now have interactive Workbook. Let's configure space on right side to include two time charts. One with time graph showing CPU load over time and one with CPU load over time for selected process.
Add query with Logs as data source for your Log Analytics workspace that will look like this:
let Percentile={selectedPercentile};
Perf
| where _ResourceId == "{selectedResourceId}"
| where CounterName == "% Processor Time"
| where ObjectName == "Processor"
| where InstanceName == "_Total"
| summarize cpuLoad = percentile(CounterValue, Percentile) by _ResourceId, bin(TimeGenerated, 15m)
Set visualization to Time chart and Legend to Maximum value. Go to Advanced Settings and configure:
- Set Make this item a custom width to 50
- Hide this chart if no server is selected by clicking Make this item conditionally visible when selectedResourceId is not equal to NA
- Modify chart title to CPU load
Unguided task
You now have all information to solve the following task. Add time chart to show CPU load over time for selected process. Use full wide screen but tiny size (lenght).
Resulting workbook should look like this:
Note that in todays lab our servers might run just for few hours. For practical use of this report gather more date and modify all queries to have Time range of Last 30 days.
Instead of fixed time (such as Last 30 days) make this parameter on top (choose from last 7, 30 and 90 days) and make it part of your query.
Create workbook to show Critical logs from all computers. When computer is selected show all logs.
Open Metrics page and create your own views and pin them to dashboard. Install guest-level agent to gather other metrics such as file-system or memory usage.
Create alert based on metrics. We will use dynamic threashold (ML-based alert) and CPU usage and setup Action Group to send push notification to Azure mobile application. Generate CPU load on machine and wait for alert to popup in Azure mobile application.
Create Automation Account and onboard VM to it. Check missing updates and learn how to plan patching deployments.
Investigate invetory tracking for list of installed components and applications as well as change tracking to see history.
Import PowerShell DSC to install IIS, onboard VM to Azure Automation Configuration Management and check IIS is getting installed.
Prerequisites: Azure Subscription with 100 cores quota
Training:
- Compute
- Storage
- Networking
Dates: Prague 15.5.19, Bratislava 4.6.
Prerequisites:
- Compute lab completed
- Complete Azure Backup section steps 1-3 (Backup section on VM in portal)
- Turn on compute lab few days before training to collect data
- If you want to have more data collected onboard monitoring solutions few days before training in following sections of VM configuration in Portal and use one Log Analytics workspace and Automation Account you will create in wizard:
- Update Management
- Inventory
- Change Tracking
- Insights
- Enable guest-level monitoring in Diagnostic settings
Notebook with:
- Azure CLI
- Visual Studio Code
- Azure Storage Explorer
Training:
- Backup
- ASR
- Monitoring
- VM operations
Dates: Prague 28.6.19
Follow this instructions.