Senior Big Data Engineer Resume
Boise, ID
SUMMARY
- Over 8+ years of diversified experience in Software Design & Development. Experience as Big Data Engineer solving business use cases for several clients. Experience in the field of software with expertise in backend applications.
- Solid experience developing Spark Applications for performing highly scalable data transformations using RDD, Data frame, Spark - SQL, and Spark Streaming.
- Hands on experience on Kafka and Flume to load the log data from multiple sources directly in to HDFS
- Experience in using the cloud services like Amazon AWS EMR, S3, EC2, Red shift and Athena.
- Strong expertise in building scalable applications using various programming languages (Java, Scala, and Python).
- Proficient in Core Java concepts like Multi-threading, Collections and Exception Handling concepts.
- Experience of developing applications with Model View Architecture (MVC2) using Spring Framework and J2EE Design Patterns.
- Strong experience troubleshooting Spark failures and fine-tuning long running Spark applications.
- Strong experience working with various configurations of Spark like broadcast thresholds, increasing shuffle partitions, caching, repartitioning etc., to improve the performance of the jobs.
- Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
- Strong experience of operating with cloud environments such as EC2 and S3 of Amazon Web Services (AWS).
- Continuous Delivery pipeline deployment experience with Maven, Ant, Jenkins, and AWS.
- Strong understanding of Distributed systems design, HDFS architecture, internal working details of MapReduce and Spark processing frameworks
- Experience in MVC and Microservices Architecture with Spring Boot and Docker, Swamp.
- Experience on Migrating SQL database to Azure Data Lake, Azure data lake Analytics, Azure SQL Database, Data Bricks and Azure SQL Data warehouse and Controlling and granting database access and Migrating On premise databases to Azure Data lake store using Azure Data factory
- Solid experience in using the various file formats like CSV, TSV, Parquet, ORC, JSON and AVRO.
- Experienced working with various Hadoop Distributions (Cloudera, Hortonworks, Amazon EMR) to fully implement and leverage various Hadoop services.
- Well versed in writing complex hive queries using analytical functions.
- Knowledge in writing custom UDF’s in Hive to support custom business requirements.
- Experienced in working with structured data using HiveQL, join operations, writing custom UDFs and optimizing Hive queries.
- Migrating SQL database to Azure Data Lake, Azure data lake analytics, Azure SQL database,
- Expertise in using Docker and setting up ELK with Docker and Docker-Compose. Actively involved in deployments on Docker using Kubernetes.
- ConfiguredSpark Streamingto receive real time data fromKafkaand store the stream data to HDFS and process it usingSparkandScala.
- Strong Experience in working with Databases like Oracle, and MySQL, Teradata, Netezza and proficiency in writing complex SQL queries.
- Experience in version control tools like SVN, GitHub and CVS.
TECHNICAL SKILLS
Operating Systems: Unix, Linux, Windows
Programming Languages: Java, Python 3, Scala 2.12.8, PySpark, C, C++
Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase
Cluster Management & monitoring: CDH, Horton Works Ambari
Data Bases: MySQL, SQL Server, Oracle 12c, MS Access
NoSQL Data Bases: MongoDB, Cassandra, HBase, KairosDB
Workflow mgmt tools: Oozie, Apache Airflow
Visualization & ETL tools: Tableau, BananaUI, D3.js, Informatica, Talend
Cloud Technologies: Azure, AWS
IDE’s: Eclipse, Jupyter notebook, Spyder, PyCharm, IntelliJ
Version Control Systems: Git, SVN
PROFESSIONAL EXPERIENCE
Confidential, Boise, ID
Senior Big Data Engineer
Responsibilities:
- Work in a fast-paced agile development environment to quickly analyze, develop, and test potential use cases for the business.
- Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
- Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
- Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different environments for the SSAS cubes (OLAP)
- Creating datamodel that correlates all the metrics and gives a valuable output.
- Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
- Worked on ETL Migration services by developing and deploying AWS Lambda functions for generating a serverless data pipeline which can be written to Glue Catalog and can be queried from Athena.
- Developed a detailed project plan and helped manage the data conversion migration from the legacy system to the target snowflake database
- Create data pipelines to use for business reports and process streaming data by using Kafka on premise cluster.
- Process the data from Kafka pipelines from topics and show the real time streaming in dashboards
- Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
- Developing Spark programs with Python, and applied principles of functional programming to process the complex structured data sets.
- Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
- Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's
- Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
- Performed data preprocessing and feature engineering for further predictive analytics using Python Pandas.
- Create Spark code to processstreaming datafromKafkacluster and load the data to staging area for processing.
- Migrate data from on-premises to AWS storage buckets
- Developed a python script to hit REST API’s and extract data to AWS S3
- Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
- Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated them into data warehouse.
- Design, develop, and test dimensionaldatamodels using Star andSnowflakeschemamethodologies under the Kimball method.
- DesignedAWSCloud Formation templates to create VPC, subnets, NAT to ensure successful deployment of Web applications and database templates.
- Created a Lambda Deployment function, and configured it to receive events from S3 buckets
- Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
- Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
- Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms
- SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N, Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom reports
- Created action filters, parameters and calculated sets for preparing dashboards and worksheets using PowerBI
Environment: Spark, Python, Scala, Kafka, AWS, EC2, Redshift, S3 Buckets, ETL, Tableau, Presto, Hive/Hadoop, Snowflakes, AWS Data Pipeline, IBM Cognos 10.1, Data Stage, Cognos Report Studio 10.1, Cognos Connection, Cognos office Connection, Cognos, Data stage and Quality Stage, Oracle, Sql Server, Shell Scripting, Git
Confidential, Rochester, MN
Big Data Engineer
Responsibilities:
- Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing Developed Apache Spark applications by using spark for data processing from various streaming sources.
- Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
- Used Airflow to monitor and schedule the work
- ConfiguredSpark streamingto get ongoing information from theKafkaand store the stream information to HDFS.
- Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
- CreateSpark Vectorized panda user definedfunctions for data manipulation and wrangling
- Involved in creating HDInsight cluster in Microsoft Azure Portal also created Eventshub and Azure SQL Databases.
- Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business valueusing Azure Data Factory
- Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
- Writing PySpark and spark Sql transformation in Azure Databricks to perform complex transformations for business rule implementation
- Developed HIVE UDFs to in corporate external business logic into Hive script and Developed join data set scripts using HIVE join operations.
- CreateSelf Servicereportingin Azure Data Lake Store Gen2using an ELT approach.
- Built real time pipeline for streaming data using Eventshub/Microsoft Azure Queue and Spark streaming.
- Extracted and updated the data into HDFS using Sqoop import and export.
- Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
- Delivered de normalized data forPower BIconsumers for modeling and visualization from the produced layer in Data lake
- Developed various Oracle SQL scripts, PL/SQL packages, procedures, functions, and java code for data
- Worked on a clustered Hadoop for Windows Azure using HDInsight and Hortonworks Data Platform for Windows.
- Setting up Azure infrastructure likestorage accounts, integration runtime, service principalid, app registrations to enablescalable and optimizedutilization of business user analytical requirements in Azure.
- Implemented Kafka producers create custom partitions, configured brokers and implemented High level consumers to implement data platform.
- Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, DevOps, Spark YARN
- Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage
- Creating Data factory pipelines that can bulk copy multiple tables at once from relational database to Azure data lake gen2
- Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business valueusing Azure Data Factory
- Developed and designeddata integrationandmigration solutionsinAzure.
- Responsible to manage data coming from different sources through Kafka.
- Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources
- Implement Continuous integration/continuous development best practice using Azure DevOps, ensuring code versioning
Environment: Hadoop, Spark, MapReduce, Kafka, Docker, Jenkins, Scala, JAVA, Azure Data Lake Gen2, Azure Data Factory, PySpark, Databricks, Azure DevOps, Agile, Power BI, Python, R, PL/SQL, Oracle 12c, SQL, No SQL, HBase, Scaled Agile team environment
Confidential, Boston, MA
Data Engineer
Responsibilities:
- Managed security groups on AWS, focusing on high-availability, fault-tolerance, and auto scaling using Terraform templates. Along with Continuous Integration and Continuous Deployment with AWS Lambda and AWS code pipeline.
- Created Data Quality Scripts using SQL and Hive to validate successful das ta load and quality of the data. Created various types of data visualizations using Python and Tableau.
- Wrote various data normalization jobs for new data ingested into Redshift.
- Created various complex SSIS/ETL packages to Extract, Transform and Load data
- UsedZookeeperto store offsets of messages consumed for a specific topic and partition by a specific Consumer Group in Kafka.
- UsedKafkafunctionalities like distribution, partition, replicated commit log service for messaging systems by maintaining feeds and Created applications using Kafka, which monitors consumer lag withinApache Kafkaclusters.
- Used Hive SQL, Presto SQL and Spark SQL for ETL jobs and using the right technology for the job to get done.
- Migrated on premise database structure to Confidential Redshift data warehouse
- Was responsible for ETL and data validation using SQL Server Integration Services.
- Worked on Big data on AWS cloud services i.e. EC2, S3, EMR and DynamoDB
- Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and SAS Visual Analytics.
- Involved in the Forward Engineering of the logical models to generate the physical model using Erwin and generate Data Models using ERwin and subsequent deployment to Enterprise Data Warehouse.
- Defined and deployed monitoring, metrics, and logging systems on AWS
- Developed SSRS reports, SSIS packages to Extract, Transform and Load data from various source systems
- Created Entity Relationship Diagrams (ERD), Functional diagrams, Data flow diagrams and enforced referential integrity constraints and created logical and physical models using Erwin.
- Created ad hoc queries and reports to support business decisions SQL Server Reporting Services (SSRS).
- Analyze the existing application programs and tune SQL queries using execution plan, query analyzer, SQL Profiler and database engine tuning advisor to enhance performance.
Environment: Informatica, RDS, NOSQL, Snow Flake Schema, Apache Kafka, Python, Zookeeper, SQL Server, Erwin, Oracle, Redshift, MySQL, PostgreSQL.
Confidential
Hadoop Developer/ Data Engineer
Responsibilities:
- Designed, developed and did maintenance of data integration programs in a Hadoop and RDBMS environment with both traditional and non-traditional source systems.
- Worked extensively on AWS Components such as Airflow, Elastic Map Reduce (EMR), Atana, Snowflake.
- DevelopedPythonscripts to find vulnerabilities with SQL Queries by doing SQL injection
- Loaded data fromUNIXfile system to HDFS and writtenHive User Defined Functions.
- Used Sqoop to load data from DB2 toHBasefor faster querying and performance optimization.
- DevelopedHive scriptsfor implementing dynamic partitions.
- Experience in writing SQOOP Scripts for importing and exporting data from RDBMS to HDFS.
- Developed python code for different tasks, dependencies, SLA watcher and time sensor for each job for workflow management and automation using Airflow tool
- Installed and configuredHive, Pig, Sqoop, Flume and Oozieon the Hadoop cluster.
- DevelopedSqoopscripts for loading data into HDFS from DB2 and pre-processed with PIG.
- Automated the tasks of loading the data into HDFS and pre-processing with Pig by developing workflows using Oozie
- DevelopedMapReducejobs in bothPIGandHivefor data cleaning and pre-processing.
- Developed suit of Unit Test Cases forMapper, Reducer and Driverclasses using testing library.
- Collected the logs data from web servers and integrated in to HDFS using Flume.
- Worked on developingETL Workflowson the data obtained using Scala for processing it in HDFS andHBaseusing Oozie.
- Written ETL jobs to visualize the data and generate reports from MySQL database using DataStage.
Environment: Hadoop, HDFS, Hive, Pig, Flume, Mapper, AWS, Flume, ETL Workflows, HBase, Python, Sqoop, Oozie, DataStage, Linux, Relational Databases, SQL Server, DB2
Confidential
Hadoop Developer
Responsibilities:
- Installed and configured Hadoop MapReduce, HDFS, Developed multiple MapReduce
- Creating Hive tables and working on them using Hive QL. Experienced indefining jobflows.
- Imported data using Sqoop to load data from Oracle to HDFS on regular basis.
- Importing and exporting data into HDFS from Oracle Database and vice versa using Sqoop
- Version Controlled using SVN.
- Installed and configured Pig and also written Pig Latin scripts.
- Developed application in Eclipse IDE. Experience in developingspring Bootapplications for transformations.
- The custom FileSystem plugin allows Hadoop MapReduce programs, HBase, Pig and Hive to work unmodified and access files directly.
- Wrote MapReduce job using Pig Latin. Involved in ETL, Data Integration and Migration.
- Build and deployed war file in WebSphere application server
Environment: Hadoop, MapReduce, HDFS, Hive, Java, Hadoop distribution of Cloudera, Pig, HBase, Oracle, Toad, MS Office, MS Excel.